Methods, apparatuses, and media for performing correlation analysis for a business of interest

By constructing graph data and generating entity sequences through random walk sampling, association rules are determined, solving the problem of reliance on manual judgment in heterogeneous entity analysis and achieving efficient and accurate association analysis.

CN116415664BActive Publication Date: 2026-06-19SHENGDOUSHI SHANGHAI SCI & TECH DEV CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENGDOUSHI SHANGHAI SCI & TECH DEV CO LTD
Filing Date
2021-12-31
Publication Date
2026-06-19

Smart Images

  • Figure CN116415664B_ABST
    Figure CN116415664B_ABST
Patent Text Reader

Abstract

Embodiments of this disclosure relate to methods, apparatus, and media for performing association analysis on a business of interest. According to the method, graph data is constructed based on multiple relationships between multiple entities concerning the business of interest. The graph data includes multiple nodes and multiple edges, where each node corresponds to a multiple entity and each edge corresponds to a multiple relationship. Random walk sampling is performed on the graph data based on one or more prior conditions constraining the walk sampling path to obtain multiple entity sequences. One or more association rules are determined for the business of interest based on the multiple entity sequences, these association rules representing the association relationships between related entities. This enables association analysis of heterogeneous entities associated with the business of interest, reduces subjective dependence, and lowers labor and trial-and-error costs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of this disclosure generally relate to the field of artificial intelligence, and more specifically to a method, apparatus, and medium for performing correlation analysis for a business of interest. Background Technology

[0002] In daily business analysis, in addition to the relationships between homogeneous entities of the same type, it is often necessary to explore the relationships between a large number of heterogeneous entities of different types, such as the relationship between consumers' offline purchasing behavior and online browsing behavior, and the relationship between consumer satisfaction and product production and service processes. Currently, the analysis of relationships between heterogeneous entities is usually carried out by data analysts relying on their subjective judgment and a large number of trials, which is time-consuming, labor-intensive, and has high human resource and trial-and-error costs.

[0003] Therefore, it is necessary to provide a method for automatically analyzing the relationships between heterogeneous entities, which can reduce subjective dependence, reduce labor costs and trial-and-error costs, and thus improve the efficiency and accuracy of analysis. Summary of the Invention

[0004] To address the aforementioned issues, this disclosure provides a method and apparatus for performing correlation analysis on business of interest, enabling correlation analysis of heterogeneous entities, reducing subjective dependence, lowering labor and trial-and-error costs, and thereby improving analysis efficiency and accuracy.

[0005] According to a first aspect of this disclosure, a method for performing association analysis for a business of interest is provided, comprising: constructing graph data based on multiple relationships between multiple entities of the business of interest, the graph data including multiple nodes and multiple edges, the multiple nodes respectively corresponding to the multiple entities, and the multiple edges respectively corresponding to the multiple relationships; performing random walk sampling on the graph data based on one or more prior conditions for constraining walk sampling paths to obtain multiple entity sequences; and determining one or more association rules for the business of interest based on the multiple entity sequences, the association rules being used to represent the association relationships between related entities.

[0006] According to a second aspect of this disclosure, a computing device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of this disclosure.

[0007] In a third aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause a computer to perform the method of the first aspect of this disclosure.

[0008] In some embodiments, the plurality of entities are divided into two categories: analytical entities and transitional entities. The analytical entities are those entities among the plurality of entities whose related relationships need to be determined, and the remaining entities among the plurality of entities other than the analytical entities are transitional entities.

[0009] In some embodiments, determining one or more association rules for the business of interest based on the plurality of entity sequences includes: filtering out entities that belong to transitional entities from the entity sequences to obtain filtered entity sequences; determining a plurality of frequent itemsets based on the set of all filtered entity sequences, wherein the support of the frequent itemsets is greater than or equal to a predetermined minimum support; and mining one or more association rules from the frequent itemsets, wherein the confidence of the association rules is greater than a predetermined minimum confidence.

[0010] In some embodiments, determining one or more association rules for the business of interest based on the plurality of entity sequences further includes: determining whether attribution analysis is needed for the association relationships between related entities; and in response to determining that attribution analysis is needed, filtering out association rules whose causal order does not meet the requirements from the one or more association rules mined.

[0011] In some embodiments, the one or more prior conditions include excluding walk sampling paths with spurious associations, where a spurious association refers to a walk sampling path traversing an entity sequence that includes more than a first threshold number of different transitional entities of the same entity type. Based on one or more prior conditions constraining the walk sampling paths, random walk sampling of the graph data includes: for the current step in a random walk sampling of the graph data in the current round, selecting one neighbor node from a plurality of neighbor nodes of the current sampling node as the next sampling node; in response to determining that the next sampling node belongs to a transitional entity of a first entity type, incrementing a first count associated with transitional entities of the first entity type; determining whether the first count is greater than or equal to the first threshold number; and in response to determining that the first count is greater than or equal to the first threshold number, excluding walk sampling paths traversed by the random walk sampling in the current round.

[0012] In some embodiments, the one or more prior conditions include that the walk sampling path must pass through at least a second threshold number of entities of a specific entity type. Based on one or more prior conditions constraining the walk sampling path, performing a random walk sampling on the graph data includes: for the current step in a random walk sampling of the graph data in the current round, selecting one neighbor node from a plurality of neighbor nodes of the current sampling node as the next sampling node; incrementing a third count in response to determining that the next sampling node belongs to the specific entity type; determining whether the third count is less than the second threshold number at the end of the random walk sampling of the current round; and excluding the walk sampling path traversed by the random walk sampling of the current round in response to determining that the third count is less than the second threshold number.

[0013] In some embodiments, the relationship includes an event relationship, and the edges of the graph data associated with the event relationship include corresponding timestamps, the timestamps indicating when the corresponding event relationship was formed.

[0014] In some embodiments, the one or more prior conditions include that the walk sampling path must traverse multiple edges in chronological order, and that performing random walk sampling on the graph data based on one or more prior conditions constraining the walk sampling path includes: for the current step of a random walk sampling in the current round of the graph data, selecting a neighbor node from multiple neighbor nodes of the current sampling node as the next sampling node; comparing a first timestamp with a second timestamp to determine whether the first timestamp is later than the second timestamp, wherein the first timestamp is the timestamp included in the edge between the next sampling node and the current sampling node, and the second timestamp is the timestamp included in the edge between the current sampling node and the previous sampling node; and excluding the walk sampling path traversed by the random walk sampling in the current round in response to determining that the first timestamp is not later than the second timestamp.

[0015] In some embodiments, selecting a node as the next sampling node from a plurality of neighboring nodes of the current sampling node includes: randomly selecting a neighboring node as the next sampling node from a plurality of neighboring nodes of the current sampling node, wherein each of the plurality of neighboring nodes has the same probability of being selected as the next sampling node.

[0016] In some embodiments, the graph data includes multiple nodes including weights, and selecting a node from multiple neighboring nodes of the current sampling node as the next sampling node includes: randomly selecting a neighboring node from multiple neighboring nodes of the current sampling node as the next sampling node, wherein the probability of each of the multiple neighboring nodes being selected as the next sampling node is positively correlated with the weights included in that neighboring node.

[0017] In some embodiments, the graph data includes multiple edges with weights, and selecting a node from multiple neighboring nodes of the current sampling node as the next sampling node includes: randomly selecting a neighboring node from multiple neighboring nodes of the current sampling node as the next sampling node, wherein the probability of each of the multiple neighboring nodes being selected as the next sampling node is positively correlated with the weights included in the edge between the neighboring node and the current sampling node.

[0018] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0019] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements.

[0020] Figure 1 A schematic diagram of a system 100 for implementing a method for performing correlation analysis for a business of interest according to an embodiment of the present invention is shown.

[0021] Figure 2 A flowchart of a method 200 for performing correlation analysis for a business of interest, according to an embodiment of the present disclosure, is shown.

[0022] Figure 3 A schematic diagram of exemplary graph data 300 according to an embodiment of the present disclosure is shown.

[0023] Figure 4 A flowchart is shown of a method 400 for determining one or more association rules for a business of interest based on a plurality of entity sequences, according to an embodiment of the present disclosure.

[0024] Figure 5 A block diagram of an electronic device 500 according to an embodiment of the present disclosure is shown. Detailed Implementation

[0025] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0026] The term "comprising" and its variations as used herein signify open inclusion, i.e., "including but not limited to". Unless otherwise stated, the term "or" means "and / or". The term "based on" means "at least partially based on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0027] As mentioned above, in daily business analysis, in addition to the relationships between homogeneous entities of the same type, it is often necessary to explore the relationships between a large number of heterogeneous entities of different types, such as the relationship between consumers' offline purchasing behavior and online browsing behavior, and the relationship between consumer satisfaction and product production and service processes. Currently, the analysis of relationships between heterogeneous entities is usually carried out by data analysts relying on their subjective judgment and a large number of trials, which is time-consuming, labor-intensive, and has high human resource and trial-and-error costs.

[0028] To at least partially address one or more of the aforementioned problems and other potential issues, exemplary embodiments of this disclosure propose a method for heterogeneous entity association analysis, comprising: constructing graph data based on multiple relationships between multiple entities concerning a business of interest, the graph data including multiple nodes and multiple edges, the multiple nodes corresponding to the multiple entities and the multiple edges corresponding to the multiple relationships; performing random walk sampling on the graph data based on one or more prior conditions constraining the walk sampling path to obtain multiple entity sequences; and determining one or more association rules for the business of interest based on the multiple entity sequences, the association rules representing the association relationships between related entities. This approach reduces subjective dependence, lowers labor costs and trial-and-error costs, thereby improving analytical efficiency and accuracy.

[0029] Figure 1 A schematic diagram of a system 100 for implementing a method for performing correlation analysis for a business of interest according to an embodiment of the present invention is shown. Figure 1 As shown, system 100 includes, for example, a computing device 110, multiple user terminals 120-1, 120-M to 120-N, and a network 140. The computing device 110 can interact with the multiple user terminals 120-1, 120-M to 120-N via the network 140.

[0030] The computing device 110 includes, but is not limited to, server computers, multiprocessor systems, mainframe computers, and distributed computing environments that include any of the aforementioned systems or devices. In some embodiments, the computing device 110 may have one or more processing units, including dedicated processing units such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs), as well as general-purpose processing units such as central processing units (CPUs).

[0031] The computing device 110 can be used, for example, for heterogeneous entity association analysis. User terminals 120-1, 120-M to 120-N, including but not limited to users' mobile phones and computers, can be used to select desired association rules from the association rules determined by the computing device 110. Each association rule represents the relationship between related entities. For example, user terminals 120-1 to 120-N can view desired association rules based on the confidence and support of each association rule determined by the computing device 110. User terminals 120-1, 120-M to 120-N can also select and view desired association rules based on the characteristics of the business they are interested in, or they can view association rules related to several entities of particular interest.

[0032] Figure 2 A flowchart of a method 200 for performing correlation analysis for a business of interest, according to an embodiment of this disclosure, is shown. Method 200 may be performed by, for example... Figure 1 The computing device 110 shown can be used for execution, and can also be used in Figure 5 The method is performed at the illustrated electronic device 500. It should be understood that method 200 may also include additional boxes not shown and / or the boxes shown may be omitted, and the scope of this disclosure is not limited in this respect.

[0033] In step 202, computing device 110 constructs graph data based on multiple relationships between multiple entities concerning the business of interest. In this disclosure, graph data is structured data that stores corresponding data in the form of a graph. Specifically, graph data may include multiple nodes and multiple edges, where the multiple nodes correspond to the multiple entities and the multiple edges correspond to the multiple relationships. In this disclosure, these relationships may be determined by the physical attributes of the corresponding entities.

[0034] In this disclosure, the multiple entities relating to the business of interest may include heterogeneous entities of different types. For example, Figure 3An example of graph data 300 is shown, which includes nodes such as Element 1, Ad 1, User 1, Transaction 1, Review Tag 1, and Product 1, where Element, Ad, User, Transaction, Review Tag, and Product are all heterogeneous entities. Graph data 300 also includes the inclusion relationship between Element 1 and Ad 1, the click relationship between Ad 1 and User 1, the purchase relationship between User 1 and Transaction 1, the inclusion relationship between Transaction 1 and Product 1, the review content relationship between Review Tag 1 and Transaction 1, and so on. In some embodiments, relationships include event relationships (or factual relationships) and other types of relationships. Event relationships refer to relationships formed between corresponding entities through events or behaviors. For example, Figure 3 The relationships included in the graph data 300 shown are all event relationships, but it should be understood that relationships between entities can also be other types of relationships. In this disclosure, the edges of the graph data associated with event relationships may include corresponding timestamps indicating when the corresponding event relationship was formed.

[0035] In some embodiments, multiple entities in graph data can be divided into two categories: analytical entities and transitional entities. Analytical entities are those entities in the graph data whose relationships need to be determined, while other entities in the graph data besides analytical entities are transitional entities. In this disclosure, analytical entities refer to entities that are of particular interest to the business being studied, such as... Figure 3 In the graph data 300 shown, elements, products, and rating tags can be pre-specified as analytical entities. Transitional entities are not the objects of analysis for the business focus; they merely serve a connecting role. During random walk sampling, transitional entities are those located between two analytical entities on the walk sampling path. For example, in... Figure 3 In the graph data 300 shown, advertisements, users, and transactions can be pre-specified as transitional entities.

[0036] In step 204, random walk sampling is performed on the graph data based on one or more prior conditions used to constrain the walk sampling path to obtain multiple entity sequences.

[0037] In this disclosure, one or more prior conditions may be specified in advance. These prior conditions may include excluding walk sampling paths with spurious associations, requiring walk sampling paths to pass through at least a second threshold number of entities of a specific entity type, or requiring walk sampling paths to pass through multiple edges in chronological order, etc.

[0038] Random walk sampling refers to the process of randomly selecting a node from the neighboring nodes of a given node as the next hop node. Repeated walk processes can generate multiple walk sequences, which are referred to as entity sequences in this disclosure.

[0039] For the purpose of brevity, step 204 is described in more detail below using examples of specifying a single prior condition. However, it should be understood that the following examples are merely exemplary. In practical applications, a prior condition can be specified in advance as needed, or multiple prior conditions can be specified in advance so that the wandering sampling path is constrained by multiple prior conditions at the same time.

[0040] In some embodiments, prior conditions for excluding walk sampling paths with spurious associations can be specified in advance. In this disclosure, a spurious association refers to a sequence of entities traversed by the same walk sampling path containing more than a first threshold number of different transitional entities of the same entity type. Here, entity type is used to indicate the type of entity, which may include, for example, products, transactions, advertisements, users, elements, rating tags, etc. In this disclosure, the first threshold number can be set as needed, for example, it can be 2. In these embodiments, step 204 may further include the following sub-step: First, for the current step of random walk sampling for the current round of graph data, select one neighbor node from multiple neighbor nodes of the current sampling node as the next sampling node. In this disclosure, the neighbor node of the current sampling node refers to a node that is adjacent to the current sampling node and has an association relationship, for example, in... Figure 3 In the example shown, the neighboring nodes of element 1 include advertisement 1 and advertisement 2. Then, in response to determining that the next sampling node belongs to a transitional entity of the first entity type, a first count associated with the transitional entity of the first entity type is incremented (e.g., the first count is incremented by 1). Subsequently, it is determined whether the first count is greater than or equal to a first threshold number, and in response to determining that the first count is greater than or equal to the first threshold number, the walk sampling path traversed by the random walk sampling in the current round is excluded. It should be understood that if the next sampling node is determined to belong to a transitional entity of the second entity type, a second count associated with the transitional entity of the second entity type is incremented (e.g., the second count is incremented by 1). Then, it is determined whether the second count is greater than or equal to the first threshold number, and in response to determining that the second count is greater than or equal to the first threshold number, the walk sampling path traversed by the random walk sampling in the current round is excluded. A similar approach can be taken for transitional entities of other entity types. On the other hand, in this disclosure, in response to determining that the first count (or possibly the second count, etc.) is less than the first threshold number, the preceding steps are repeated with the selected next sampling node as the current sampling node until the random walk sampling in the current round is completed. In this disclosure, the walk sampling process aimed at determining a walk sampling path is referred to as a round of random walk sampling. For example, in Figure 3In this document, the sampling path "Element 1 → Advertisement 1 → User 1 → Transaction 1 → Product 1" is determined through a round of random walk sampling. Furthermore, in this disclosure, the sampling process of moving from one sampling node to the next is referred to as one-step sampling.

[0041] It should be understood that, in this disclosure, it is also possible to determine whether the count of transitional entities is greater than or equal to the first threshold number after the random walk sampling of the current round is completed, so as to determine whether to exclude the walk sampling path traversed by the random walk sampling of the current round.

[0042] In this disclosure, the number or maximum number of entities that the walk sampling path must traverse can also be specified in advance (i.e., the sequence length of the entity sequence or the maximum sequence length of the entity sequence). If the sequence length is specified in advance, at the end of the current round of random walk sampling, it is necessary to determine whether the length of the sequence of entities traversed by the walk sampling path is equal to the specified sequence length. If it is determined that the length of the sequence of entities traversed is not equal to the specified sequence length, the walk sampling path traversed by the current round of random walk sampling is excluded.

[0043] If a maximum sequence length is specified in advance, at the end of the random walk sampling in the current round, it is also necessary to determine whether the length of the sequence of entities traversed by the walk sampling path is less than or equal to the maximum sequence length. In response to determining that the length of the sequence of entities traversed is greater than the maximum sequence length, the walk sampling path traversed by the random walk sampling in the current round is excluded.

[0044] In some embodiments, a priori condition may be specified that the walk sampling path must traverse at least a second threshold number of entities of a specific entity type. In this disclosure, the second threshold number may also be set as needed, for example, it may be 2. In these embodiments, step 204 may further include the following sub-steps: First, for the current step of the random walk sampling for the current round of graph data, select a neighbor node from the plurality of neighbor nodes of the current sampling node as the next sampling node. Then, in response to determining that the next sampling node belongs to a specific entity type, increment a third count (e.g., increment the third count by 1). In this disclosure, the specific entity type mentioned herein is used to indicate objects of interest. For example, an analytical entity may be defined as a specific entity type of interest. Of course, other entity types may also be designated as specific entity types of interest, such as products. Subsequently, at the end of the random walk sampling for the current round, determine whether the third count is less than the second threshold number, and in response to determining that the third count is less than the second threshold number, exclude the walk sampling path traversed by the random walk sampling for the current round. On the other hand, in this disclosure, in response to determining that the third count is greater than or equal to the second threshold number, the walk sampling path traversed by the random walk sampling in the current round can be determined to identify a corresponding entity sequence, which is a sequence of multiple entities traversed by the walk sampling path.

[0045] In some embodiments, a priori condition can be specified in advance that the walk sampling path must traverse multiple edges in chronological order. In these embodiments, the edges associated with the graph data and event relationships should be marked with corresponding timestamps. Therefore, this prior condition effectively specifies that the walk sampling path must traverse each edge according to the chronological order of the timestamps, i.e., the order in which the walk sampling path traverses each edge must satisfy the forward order of the timestamps. Furthermore, in these embodiments, step 204 may further include the following sub-steps: First, for the current step of random walk sampling in the current round of the graph data, select one neighbor node from the multiple neighbor nodes of the current sampling node as the next sampling node. Then, compare the first timestamp with the second timestamp to determine whether the first timestamp is later than the second timestamp, where the first timestamp is the timestamp included in the edge between the next sampling node and the current sampling node, and the second timestamp is the timestamp included in the edge between the current sampling node and the previous sampling node. Subsequently, in response to determining that the first timestamp is not later than the second timestamp, exclude the walk sampling path traversed by the random walk sampling in the current round. On the other hand, in response to determining that the first timestamp is later than the second timestamp, the previous steps are repeated with the selected next sampling node as the current sampling node until the random walk sampling of the current round is completed.

[0046] In some embodiments, the aforementioned selection of a neighboring node from multiple neighboring nodes of the current sampling node as the next sampling node may include randomly selecting a neighboring node from multiple neighboring nodes of the current sampling node as the next sampling node, wherein each of these neighboring nodes has an equal probability of being selected as the next sampling node.

[0047] In some other embodiments, the nodes included in the graph data may include weights. By default, the weights of the nodes included in the graph data are 1, but different weights can be assigned to nodes associated with entities of different levels of interest, depending on the needs of the business being focused on. For example, nodes associated with entities of higher interest can be assigned relatively higher weights, while nodes associated with entities of lower interest can be assigned relatively lower weights. In these embodiments, selecting a node as the next sampling node from the multiple neighbor nodes of the current sampling node includes randomly selecting a neighbor node from the multiple neighbor nodes of the current sampling node, wherein the probability of each neighbor node being selected as the next sampling node is positively correlated with the weight of that neighbor node. For example, during the walk sampling process, if the weight of the i-th neighbor node that can be walked next is w... i Then the probability of reaching that neighboring node in the next step can be w. i Divide by the sum of the weights of all neighboring nodes. For example, in Figure 3 In the example graph data 300 shown, if the business that is concerned with all evaluation labels is more concerned with evaluation label 1 and wants to discover more association rules related to evaluation label 1, then in the graph data 300, the node of evaluation label 1 can be given a higher weight, such as a weight of 1.5 (or any other weight > 1), while the other evaluation label nodes are kept at the default weight of 1. Thus, the probability of evaluation label 1 being visited will be higher than that of other evaluation label nodes.

[0048] In this disclosure, for cases where multiple prior conditions are specified, the above embodiments can be combined in a certain way to achieve corresponding constraints on the wandering sampling path.

[0049] In some embodiments, the edges included in the graph data may include weights. By default, the weights of the edges included in the graph data are 1, but different weights can be assigned to edges associated with entities of different levels of interest, depending on the needs of the business being focused on. For example, edges associated with entities of higher interest may be assigned relatively higher weights, while edges associated with entities of lower interest may be assigned relatively lower weights. In these embodiments, selecting a node as the next sampling node from multiple neighboring nodes of the current sampling node may include randomly selecting a neighboring node from multiple neighboring nodes of the current sampling node, wherein the probability of each of these neighboring nodes being selected as the next sampling node is positively correlated with the weight of the edge between that neighboring node and the current sampling node.

[0050] In step 206, one or more association rules for the business of interest are determined based on multiple entity sequences (i.e., the multiple entity sequences determined in step 204). These association rules are used to represent the association relationships between related entities.

[0051] In this disclosure, the determined association rules can be ordered lists of corresponding entities, including corresponding confidence and support scores. The confidence score indicates the strength of the association, and the support score indicates the breadth of the association's coverage. For example, the example association rule "A→B, confidence a%, support b%" means that given entity A, the probability of entity B occurring is a%, and the probability of A appearing in the walk sampling path is b%. Therefore, association rules with higher confidence scores have more significant associations, and association rules with higher support scores have a greater impact on the business being studied.

[0052] Since the entity sequence is a sequence of entities determined according to the order of walk sampling, it usually refers to an ordered list. Therefore, in step 206, each entity sequence can first be degenerated into an unordered list, and then one or more association rules can be determined based on these unordered lists. The following will combine... Figure 4 Step 206 will be described in further detail.

[0053] In this disclosure, after determining the entity sequence, one or more association rules can be determined for the business of interest based on all these entity sequences. The determined one or more association rules can be configured as needed, such as... Figure 1 The user terminals 120-1, 120-M to 120-N shown can be used to view these rules. For example, based on the characteristics of the business being of interest, users can select from all the identified association rules to view the required association rules, or they can view only the association rules related to a few entities of particular interest. For example, in... Figure 3In the example shown, if the business being monitored is particularly concerned with the conversion rate of advertising elements, then all association rules containing the entities "element" and "product" can be selected for viewing. For example, still using... Figure 3 For example, if the business being monitored is particularly concerned about fluctuations in user satisfaction caused by product issues, then all association rules containing the entities "evaluation tag" and "product" can be viewed from the established association rules. In this disclosure, one can also select the desired association rules based on the confidence and support levels of each established association rule.

[0054] Figure 4 A flowchart of a method 400 for determining one or more association rules for a business of interest based on multiple entity sequences, according to an embodiment of the present disclosure, is shown. Method 400 may be derived from, for example... Figure 1 The computing device 110 shown can be used for execution, and can also be used in Figure 5 The method is performed at the illustrated electronic device 500. It should be understood that method 400 may also include additional boxes not shown and / or the boxes shown may be omitted, and the scope of this disclosure is not limited in this respect.

[0055] In some embodiments, attribution analysis is performed on non-attribution problems without order, so method 400 may include steps 402-406. In other embodiments, attribution analysis is also required for attribution problems, so method 400 may further include steps 408-410.

[0056] In step 402, entities belonging to transitional entities are filtered out from the entity sequence (e.g., each entity sequence obtained in step 204) to obtain a filtered entity sequence.

[0057] As mentioned earlier, transitional entities are not the objects of analysis for the business of interest, but rather serve a correlation function. During random walk sampling, transitional entities are typically those located between two analytical entities on the walk sampling path. Therefore, when determining the association rules for the business of interest, these transitional entities that do not need to be analyzed can be filtered out from the previously determined entity sequence, which can help to determine the relevant association rules more efficiently and accurately. For example, using... Figure 3 For example, if the entity sequences “Product 1 → Transaction 1 → Evaluation Tag 1” and “Element 1 → Advertisement 1 → User 1 → Transaction 1 → Product 1” are obtained in step 204, then in this step, by filtering out the entities Transaction 1 and Advertisement 1, User 1 and Transaction 1 that belong to the transitional entities from the two entity sequences respectively, the filtered entity sequences “Product 1 → Evaluation Tag 1” and “Element 1 → Product 1” can be obtained.

[0058] In step 404, multiple frequent itemsets are determined based on the set of all filtered entity sequences. These frequent itemsets have a support greater than or equal to a predetermined minimum support. In this disclosure, the minimum support can be selected according to the specific application.

[0059] In some embodiments, since the filtered entity sequences are typically ordered lists, in step 404, these filtered entity sequences can first be degenerated into unordered lists, and then one or more association rules can be determined based on the set of these unordered lists. That is, determining multiple frequent itemsets based on the set of all filtered entity sequences can refer to determining multiple frequent itemsets based on the set of corresponding unordered lists.

[0060] For example, in the previous example, the filtered entity sequences "Product 1 → Rating Label 1" and "Element 1 → Product 1" can be degenerated into the corresponding unordered lists {Product 1, Rating Label 1} and {Element 1, Product 1}, respectively. Therefore, in this example, the frequent itemsets can be determined based on the sets of unordered lists {Product 1, Rating Label 1} and {Element 1, Product 1}. Specifically, the Apriori algorithm or the FP-tree algorithm can be used to determine the frequent itemsets.

[0061] In this disclosure, a set containing zero or more entities may be called an itemset. An itemset containing one entity may be called a 1-itemset, an itemset containing two entities may be called a 2-itemset, and so on. A frequent itemset is an itemset whose support is greater than or equal to a predetermined minimum support. Therefore, a frequent 1-itemset is an itemset containing one entity whose support is greater than or equal to the predetermined minimum support, a frequent 2-itemset is an itemset containing two entities whose support is greater than or equal to the predetermined minimum support, and so on.

[0062] In the current embodiment, the support of an itemset is a percentage, which refers to the ratio between the count of the itemset's occurrences (indicating how many filtered entity sequences, i.e., how many corresponding unordered lists, the itemset appears in) and the number of filtered entity sequences (i.e., the corresponding unordered lists). For example, in the previous example, there are two filtered entity sequences, where product 1 appears in both sequences, so the support of the 1-itemset including product 1 is 2 / 2 = 100%; rating tag 1 appears in only one of the entity sequences, so the support of the 1-itemset including rating tag 1 is 1 / 2 = 50%; element 1 appears in only one of the entity sequences, so the support of the 1-itemset including element 1 is also 50%; similarly, the support of the 2-itemset including product 1 and rating tag 1 can be determined to be 50%, the support of the 2-itemset including element 1 and product 1 is also 50%, and so on.

[0063] In the Apriori algorithm, frequent itemsets can be classified into frequent 1-itemsets, frequent 2-itemsets, frequent 3-itemsets, etc., based on the number of entities they contain. The frequent 1-itemsets can be determined first, then the frequent 2-itemsets can be determined based on the frequent 1-itemsets, then the frequent 3-itemsets can be determined based on the frequent 2-itemsets, and so on, until the resulting frequent multi-itemsets are empty.

[0064] In the FP-tree algorithm, information in an entity sequence can be compressed by constructing an FP-tree, thus generating frequent itemsets more efficiently. An FP-tree is essentially a prefix tree, sorted in descending order of support. Frequent items with higher support are closer to the root node, allowing more frequent items to share prefixes. The root node of an FP-tree is null and does not represent any item. The tree constructed using this method can be used to determine the final frequent itemsets. This can be implemented using the standard FP-tree algorithm, and therefore will not be elaborated further in this paper.

[0065] In step 406, one or more association rules are mined from these frequent itemsets (i.e., the multiple frequent itemsets determined in step 404), and the confidence of the association rule is greater than a predetermined minimum confidence. In this disclosure, since the support of frequent itemsets is greater than the minimum support, and since the association rules are mined from frequent itemsets, the support of each association rule is actually also greater than the minimum support.

[0066] For example, each non-empty subset of each frequent itemset can be determined first. Then, for each non-empty subset, the confidence level of the non-empty subset can be obtained by determining the ratio (i.e., percentage) between the support of the frequent itemset and the support of the non-empty subset. If the confidence level of the non-empty subset is greater than a predetermined minimum confidence level, then the association rule determined based on the non-empty subset is one of the association rules to be mined. In this disclosure, the association rule determined based on the non-empty subset refers to an association rule with the non-empty subset as the antecedent and the remaining entities in the corresponding frequent itemset as the consequent. For example, in the association rule "Element 1 → Product 1", Element 1 is the antecedent of the association rule, and Product 1 is the consequent of the association rule.

[0067] In step 408, it is determined whether attribution analysis is needed to analyze the relationships between related entities.

[0068] For example, if the business being monitored frequently experiences various problems, it is necessary to analyze the underlying causes of these problems. In this case, it is possible to pre-set the relationships between relevant entities for attribution analysis. If the relationships between relevant entities for attribution analysis are pre-set, then in step 410, it can be determined whether such reduction analysis is required.

[0069] In step 410, in response to determining that such attribution analysis is required, association rules whose causal order does not meet the requirements are filtered out from one or more of the mined association rules.

[0070] For example, if a user places an order for product 1 after viewing an advertisement containing element 1, and if two association rules are established based on the two sequences mentioned earlier—"Element 1 → Product 1, First Confidence, First Support" and "Product 1 → Element 1, Second Confidence, Second Support"—then "Element 1 → Product 1, First Confidence, First Support" follows a causal order, while "Product 1 → Element 1, Second Confidence, Second Support" does not, because the timestamp of the edge associated with element 1 is earlier than the timestamp of the edge associated with product 1. Therefore, the association rule "Product 1 → Element 1, Second Confidence, Second Support" is filtered out, and only the association rule "Element 1 → Product 1, First Confidence, First Support" is retained. Based on this association rule, the probability that the user will purchase product 1 after viewing an advertisement containing element 1 can be determined as the first confidence level.

[0071] By employing the above methods, this disclosure can effectively improve the efficiency and accuracy of determining association rules for the business of interest.

[0072] Furthermore, in this disclosure, when attribution analysis is required as pre-set, random walk sampling can be performed on graph data based at least on the convergence condition that the sampling path must pass through multiple edges in chronological order. This allows entity sequences that do not meet such convergence conditions to be excluded, thereby improving the efficiency of such attribution analysis.

[0073] Figure 5 A schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure is shown. For example, such as Figure 1 The computing device 110 shown can be implemented by an electronic device 500. As shown, the electronic device 500 includes a central processing unit (CPU) 501, which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) 502 or loaded from storage unit 508 into random access memory (RAM) 503. The random access memory 503 can also store various programs and data required for the operation of the electronic device 500. The CPU 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.

[0074] Multiple components in electronic device 500 are connected to input / output interface 505, including: input unit 506, such as keyboard, mouse, microphone, etc.; output unit 507, such as various types of monitors, speakers, etc.; storage unit 508, such as disk, optical disk, etc.; and communication unit 509, such as network card, modem, wireless transceiver, etc. Communication unit 509 allows device 500 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0075] The various processes and handling described above, such as methods 200 and 400, can be executed by the central processing unit 501. For example, in some embodiments, methods 200 and 400 may be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and / or installed on device 500 via read-only memory 502 and / or communication unit 509. When the computer program is loaded into random access memory 503 and executed by the central processing unit 501, one or more actions of methods 200 and 400 described above can be performed.

[0076] This disclosure relates to methods, apparatus, systems, electronic devices, computer-readable storage media, and / or computer program products. A computer program product may include computer-readable program instructions for performing various aspects of this disclosure.

[0077] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0078] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge computing devices. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to computer-readable storage media within the respective computing / processing device.

[0079] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.

[0080] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0081] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0082] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0083] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0084] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for performing correlation analysis on business of interest, comprising: Graph data is constructed based on multiple relationships between multiple entities related to the business of interest. The graph data includes multiple nodes and multiple edges. The multiple nodes correspond to the multiple entities, and the multiple edges correspond to the multiple relationships. The multiple entities are divided into two categories: analytical entities and transitional entities. The analytical entities are the entities among the multiple entities whose related relationships need to be determined, and the remaining entities among the multiple entities other than the analytical entities are transitional entities. The graph data is randomly sampled based on one or more prior conditions used to constrain the walk sampling path to obtain multiple entity sequences; as well as Based on the multiple entity sequences, one or more association rules are determined for the business of interest. These association rules represent the relationships between related entities. Determining one or more association rules for the business of interest based on the multiple entity sequences includes: Entities belonging to transitional entities are filtered out from the entity sequence to obtain a filtered entity sequence; Multiple frequent itemsets are determined based on the set of all filtered entity sequences, wherein the support of the frequent itemsets is greater than or equal to a predetermined minimum support. Determining multiple frequent itemsets based on the set of all filtered entity sequences includes: Degenerate all the filtered entity sequences into corresponding unordered lists; and The plurality of frequent itemsets are determined based on the corresponding set of unordered lists; One or more association rules are mined from the frequent itemset, wherein the confidence level of the association rules is greater than a predetermined minimum confidence level.

2. The method of claim 1, wherein determining one or more association rules for the business of interest based on the plurality of entity sequences further comprises: Determine whether attribution analysis is needed to examine the relationships between the relevant entities; as well as In response to the determination that the attribution analysis needs to be performed, association rules that do not meet the causal order requirements are filtered out from one or more of the mined association rules.

3. The method of claim 1, wherein the one or more prior conditions include: Excluding walk sampling paths with spurious associations, where a spurious association refers to a walk sampling path traversing an entity sequence that includes more than a first threshold number of different transitional entities of the same entity type, and performing random walk sampling on the graph data based on one or more prior conditions used to constrain the walk sampling paths, includes: For the current step of random walk sampling in the current round of the graph data, select one neighbor node from the multiple neighbor nodes of the current sampling node as the next sampling node; In response to determining that the next sampling node belongs to a transitional entity of the first entity type, increment the first count associated with the transitional entity of the first entity type; Determine whether the first count is greater than or equal to the first threshold number; and In response to determining that the first count is greater than or equal to the first threshold number, the walk sampling path traversed by the random walk sampling in the current round is excluded.

4. The method of claim 1, wherein the one or more prior conditions include: The walk sampling path must traverse at least a second threshold number of entities of a specific entity type. Random walk sampling of the graph data, based on one or more prior conditions constraining the walk sampling path, includes: For the current step of random walk sampling in the current round of the graph data, select one neighbor node from the multiple neighbor nodes of the current sampling node as the next sampling node; In response to determining that the next sampling node belongs to the specific entity type, increment the third count; At the end of the current round of random walk sampling, determine whether the third count is less than the second threshold number; and In response to determining that the third count is less than the second threshold number, the walk sampling path traversed by the random walk sampling in the current round is excluded.

5. The method according to claim 1, wherein the relationship includes an event relationship, and the edges of the graph data associated with the event relationship include corresponding timestamps, the timestamps indicating the time when the corresponding event relationship was formed.

6. The method of claim 5, wherein the one or more prior conditions include: The walk sampling path must traverse multiple edges in chronological order, and random walk sampling of the graph data includes, based on one or more prior conditions used to constrain the walk sampling path: For the current step of random walk sampling in the current round of the graph data, select one neighbor node from the multiple neighbor nodes of the current sampling node as the next sampling node; The first timestamp is compared with the second timestamp to determine whether the first timestamp is later than the second timestamp, wherein the first timestamp is the timestamp included in the edge between the next sampling node and the current sampling node, and the second timestamp is the timestamp included in the edge between the current sampling node and the previous sampling node; In response to determining that the first timestamp is not later than the second timestamp, the walk sampling path traversed by the random walk sampling in the current round is excluded.

7. The method according to any one of claims 3, 4, and 6, wherein selecting a node from a plurality of neighboring nodes of the current sampling node as the next sampling node comprises: Randomly select one neighbor node from the multiple neighbor nodes of the current sampling node as the next sampling node, wherein each of the multiple neighbor nodes has the same probability of being selected as the next sampling node.

8. The method according to any one of claims 3, 4, and 6, wherein the plurality of nodes included in the graph data includes weights, and selecting a node from a plurality of neighboring nodes of the current sampling node as the next sampling node comprises: Randomly select one neighbor node from the multiple neighbor nodes of the current sampling node as the next sampling node, wherein the probability of each of the multiple neighbor nodes being selected as the next sampling node is positively correlated with the weight included in that neighbor node.

9. The method according to any one of claims 3, 4, and 6, wherein the plurality of edges included in the graph data include weights, and selecting a node from a plurality of neighboring nodes of the current sampling node as the next sampling node comprises: Randomly select one of the multiple neighboring nodes of the current sampling node as the next sampling node, wherein the probability of each of the multiple neighboring nodes being selected as the next sampling node is positively correlated with the weight of the edge between the neighboring node and the current sampling node.

10. A computing device, comprising: At least one processor; as well as A memory that is communicatively connected to the at least one processor; The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

11. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the method of any one of claims 1-9.