A visual intention understanding method and system based on an uncertainty cross-granularity evidence feature fusion network and a storage medium

By constructing an uncertainty cross-granularity evidence feature fusion network, the problem of insufficient utilization of granular-level information in visual intent understanding is solved, and more accurate and reliable visual intent recognition is achieved.

CN118196592BActive Publication Date: 2026-06-12SOUTHEAST UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2024-04-02
Publication Date
2026-06-12

Smart Images

  • Figure CN118196592B_ABST
    Figure CN118196592B_ABST
Patent Text Reader

Abstract

The application discloses a visual intention understanding method and system based on an uncertainty cross-granularity evidence feature fusion network and a storage medium, and comprises the following steps: constructing an uncertainty cross-granularity evidence feature fusion network, obtaining binary membership evidence pairs corresponding to each intention category under different granularities; aligning the cross-granularity hierarchical representation of the evidence pairs; constructing an evidence-guided uncertainty estimation network; fusing opinions from different granularities; inputting the training image into the cross-granularity evidence feature fusion network to obtain an intention understanding result, and then inputting the intention understanding result into a binary evidence loss function to train the uncertainty cross-granularity evidence feature fusion network; and inputting the test image into the trained cross-granularity evidence feature fusion network to obtain an understanding result of the human intention behind the image. The application also comprises a cross-granularity evidence alignment strategy based on a hierarchical relationship, which aligns the results under different granularities into a unified form; and an opinion combination rule based on uncertainty, which fuses opinions from different granularities. The application integrates the evidence theory into an uncertainty framework, utilizes the uncertainty to guide the cross-granularity fusion, enhances the representation ability of the network to the cross-granularity information, greatly reduces the influence of intention category ambiguity, improves the comprehensive understanding of visual content and the recognition ability of the human intention behind the visual content, and thus improves the ability of human-computer visual interaction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision understanding technology, and in particular to a visual intent understanding method, system, and storage medium based on an uncertain cross-granularity evidence feature fusion network. Background Technology

[0002] Visual intent understanding is a multi-category psychology task aimed at understanding the inherent intentions behind images associated with human behavior. Visual content is a primary form of information delivery in social media, playing a crucial role in various applications such as psychological assessment, image captioning, and visual question answering. Compared to textual intent understanding, visual content is more comprehensive and difficult to process, involving various objects, backgrounds, and implicit relationships. Furthermore, due to the highly subjective nature of human intent, the mapping from visual features to intent categories is more complex, leading to blurred distinctions between different intents. Specifically, image content belonging to the same intent category is extremely diverse, rich, and even completely different; these features cannot be defined by specific shapes, objects, or scenes. Intent categories cannot be obtained by simply segmenting and identifying visual content in an image, but rather by the complex and underlying relationships between visual content. Moreover, even images containing the exact same object may belong to drastically different intent categories. For example, a sunflower blooming in the sun and withering in the rain convey opposite intents: "happiness" and "sadness."

[0003] Visual intent understanding has been applied across multiple fields, primarily focusing on advertising understanding, political propaganda understanding, and understanding the motivations behind human behavior. Advertising understanding emphasizes the combined influence of multimodal information such as images, videos, and advertising slogans; political propaganda understanding tends to focus on individual gestures and facial expressions; while motivational understanding prioritizes human behavior. To understand the underlying intent behind all images related to human behavior, considering only a single modality, it is necessary to comprehensively consider both low-level visual features and high-level semantic relationships within the images. Furthermore, there are complex relationships between visual features and intent categories that go beyond the one-to-one correspondence found in traditional recognition tasks. These challenges lead to significant ambiguity among intent categories. To improve the discriminative power between intent categories, recent research has extracted multi-granular features from hierarchical trees of intent categories, establishing hierarchical constraints.

[0004] While existing hierarchical methods have achieved good performance, they ignore the varying degrees of category ambiguity at different granularity levels. For example, many methods integrate multi-granularity information only at the output layer. Furthermore, some methods utilize hierarchical constraints at the feature layer to transform fine-grained outputs into coarse-grained inputs. These methods deny the differences in the degree of category ambiguity at the granularity level, assuming that cross-granularity results can be compared and integrated equally. To effectively utilize information at different granularities and the constraints of intent categories at each level to achieve effective visual intent recognition, a method is needed that can reveal the different degrees of ambiguity between intent categories at different granularities and, based on this, achieve cross-granularity fusion, effectively addressing the aforementioned problems. Summary of the Invention

[0005] Purpose of the invention: The purpose of this invention is to provide a visual intent understanding method, system, and storage medium based on an uncertain cross-granularity evidence feature fusion network, which can improve the comprehensive understanding of the visual content of images and the recognition of the intent behind them.

[0006] Technical Solution: To achieve the above objectives, the present invention provides a visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network, comprising the following steps:

[0007] Step 1: Construct an uncertainty cross-granularity evidence feature fusion network, including an image feature extraction network, an evidence generation network, and an evidence-guided uncertainty estimation network;

[0008] Step 2: Obtain binary membership evidence pairs corresponding to each intent category at different granularities in the training images based on image feature extraction networks and evidence generation networks;

[0009] Step 3: Align the cross-granularity hierarchical representation of binary membership evidence pairs;

[0010] Step 4: Utilize an evidence-guided uncertainty estimation network to allocate confidence and overall uncertainty to binary membership evidence pairs corresponding to each intent category at different granularities, in order to generate subjective opinions for each intent category at different granularities;

[0011] Step 5: Integrate subjective opinions from different granularities to obtain intent understanding of training images at different granularities;

[0012] Step 6: Feed the intent understanding results obtained from the training images at different granularities into the binary evidence loss function, and use the binary evidence loss function to supervise the training of the cross-granularity evidence feature fusion network in Step 1 to optimize the network parameters;

[0013] Step 7: Input the test image into the trained cross-granularity evidence feature fusion network to obtain the visual intent understanding of the image, thereby obtaining the true intent behind the image.

[0014] The image feature extraction network in step 1 includes a sample-level image feature extractor, wherein the sample level refers to each sample corresponding to only a single feature map, and the feature extractor is a pre-trained convolutional neural network used to extract shallow visual feature maps of the training images.

[0015] The evidence generation network includes evidence generation networks at various granularity levels. Its input is a shallow visual feature map, its output dimension is K×2, and the last layer is a non-negative activation layer. The granularity refers to different levels of analyzing training images, including coarse, medium, and fine granularity levels.

[0016] In step 2, an image feature extraction network and an evidence generation network are used to obtain binary membership evidence pairs corresponding to each intent category at different granularities in the training image. Specifically, the image feature extraction network is used to extract shallow visual feature maps of the training image. Based on the shallow visual feature maps, the evidence generation network generates binary membership evidence pairs corresponding to each intent category at different granularity levels. The binary membership evidence pairs represent the feature information of the training image related to the intent category at different granularities, and each intent category represents a different meaning or goal expressed by the training image.

[0017] Wherein, the binary membership evidence pair mentioned in step 2 is represented as: Here, binary membership evidence pairs include evidence supporting non-belonging to the category and evidence supporting belonging to the category; K represents the total number of intent categories, and k refers to a certain intent category, k = 1, 2, ..., K. Evidence indicating that the intention does not belong to category k Evidence indicating membership in category k.

[0018] In step 3, the cross-granularity hierarchical representation of the aligned evidence pairs is because the number of intent categories is different at different granularities, and the dimensions of the binary membership classification results are also different. Therefore, the classification results of binary membership evidence pairs at different granularities are aligned into a unified form.

[0019] The alignment method includes a cross-granularity evidence alignment strategy based on hierarchical relationships, wherein the hierarchical relationships include the constraint relationship between coarse-grained categories and fine-grained categories, and the complementary relationship between fine-grained categories and coarse-grained categories.

[0020] The cross-granularity evidence alignment includes: aligning fine-grained evidence pairs to a medium-grained form, and aligning medium-grained evidence pairs to a coarse-grained form. The cross-granularity evidence alignment strategy refers to grouping subclasses belonging to the same coarse or medium-grained class, and integrating binary membership evidence pairs belonging to the same group to align the coarse-grained results. The integration method is as follows:

[0021]

[0022]

[0023] Where, k i This represents a subclass of intent category k at granularity level m. Evidence indicating that the intention does not belong to category k Evidence indicating membership in category k.

[0024] In step 4, the process of allocating confidence and overall uncertainty is as follows: the evidence-guided uncertainty estimation network assigns confidence and uncertainty for intent categories by combining a Dirichlet distribution with binary membership evidence pairs. The combination refers to combining the parameters of the Dirichlet distribution... Binary membership evidence To establish a connection, follow these steps:

[0025]

[0026] Based on the Dirichlet distribution, the intention category k-score is configured with reliability for binary membership evidence pairs: Distribution of overall uncertainty u k ;

[0027] Based on confidence level and overall uncertainty, a subjective opinion O is formed for intention category k. k for:

[0028]

[0029] in, Let i be the Dirichlet intensity, i = {-, +}. For the quality of belief.

[0030] Among them, the opinions of different granularities mentioned in step 5 include opinions of coarse granularity, opinions of medium granularity, opinions of medium granularity integrated into coarse granularity, opinions of fine granularity, and opinions of fine granularity integrated into medium granularity.

[0031] The fusion of subjective opinions from different granularities includes the fusion of opinions of the same form but from different granularity layers. Specifically, it includes the fusion of opinions from the coarse-grained layer and opinions from the medium-grained layer integrated into a coarse-grained layer form, and the fusion of opinions from the medium-grained layer and opinions from the fine-grained layer integrated into a medium-grained layer form.

[0032] The fusion method includes two independent subjective opinions. Combined into a joint opinion The combination rules are as follows:

[0033]

[0034]

[0035]

[0036]

[0037] in, It is the conflict quantity between two subjective opinions, i or j = {-,+}; It is a regularization factor; For the confidence level of the joint opinion, Due to the uncertainty of the joint opinion;

[0038] By fusing, we can obtain the intent understanding results of the training images at three granularities: coarse, medium, and fine. That is, the subjective opinions corresponding to each intent category at the three granularities are fused. Specifically, the intent understanding results at the coarse granularity include the joint opinions from the coarse granular layer and the opinions from the medium granular layer integrated into the coarse granular layer form. The intent understanding results at the medium granularity include the joint opinions from the medium granular layer and the opinions from the fine granular layer integrated into the coarse granular layer form. The intent understanding at the fine granularity includes the opinions from the fine granular layer.

[0039] The formula for the binary evidence loss function in step 6 is as follows:

[0040]

[0041] Among them, L BCE (α k The binary cross-entropy loss, modified for deep learning based on binary evidence, is formulated as follows:

[0042]

[0043] in, Let represent the binary truth value of the training image with respect to the intent class k. If the training image belongs to the intent class k, then... If the training image does not belong to the intent category k, then L KL (α k ) represents the regularization term, and the formula for the regularization term is as follows:

[0044]

[0045] Among them, K m λ is the number of intent categories at granularity level m. t=min(1.0,t / 10)∈[0,1] is the decay coefficient, t is the current training round; during training, λ t Gradually increase the amount of attention to the KL divergence in the initial stages of training, so that the network can explore the parameter space more.

[0046] The visual intent understanding system based on an uncertain cross-granularity evidence feature fusion network, as described in this invention, includes the following modules:

[0047] Network construction module: Constructs an uncertainty cross-granularity evidence feature fusion network, including an image feature extraction network, an evidence generation network, and an evidence-guided uncertainty estimation network;

[0048] Evidence Pair Generation and Alignment Module: Based on image feature extraction network and evidence generation network, it obtains binary membership evidence pairs corresponding to each intent category at different granularities in the training image; and aligns the cross-granularity hierarchical representation of binary membership evidence pairs.

[0049] Opinion generation module: Utilizes an evidence-guided uncertainty estimation network to classify the binary membership evidence pairs corresponding to each intent category at different granularities, and allocates confidence and overall uncertainty to generate subjective opinions for each intent category at different granularities;

[0050] Opinion Fusion Module: Fuses subjective opinions from different granularities to obtain intent understanding of training images at different granularities;

[0051] Network training module: The intent understanding results obtained from training images at different granularities are fed into the binary evidence loss function. The network is trained under supervision through cross-granularity evidence feature fusion using the binary evidence loss function, and the network parameters are optimized.

[0052] Image Intent Acquisition Module: Input the test image into the trained cross-granularity evidence feature fusion network to obtain the visual intent understanding of the image, thereby obtaining the true intent behind the image.

[0053] The present invention provides a computer-readable storage medium for storing one or more programs, wherein the one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods described above for visual intent understanding.

[0054] Beneficial effects: The present invention has the following advantages: 1. The present invention obtains binary membership evidence pairs corresponding to each intention category at different granularities in the image based on image feature extraction network and evidence generation network. Then, by introducing uncertainty estimation network, it assigns confidence and overall uncertainty to each intention category of binary membership evidence pairs at different granularities to generate subjective opinions of each intention category at different granularities. This reveals the degree of ambiguity of intention categories at different granularities, strengthens the differences between the information contained at different granularities, and effectively ensures the reliability of the representation of intention categories at different granularities.

[0055] 2. This invention performs cross-granularity fusion on subjective opinions of the same form from different granularities to obtain the intent understanding of the image at different granularities. This process fully considers the similarity and difference between subjective opinions at different granularities, utilizes the complementarity of information contained in different granularities, significantly improves the efficiency of multi-granularity information in processing complex visual intent, enhances the accuracy and robustness of visual intent understanding, and ultimately achieves effective recognition of image visual intent. Attached Figure Description

[0056] Figure 1 This is a schematic diagram of the method flow of the present invention;

[0057] Figure 2 A schematic diagram of a cross-granularity evidence feature fusion network structure;

[0058] Figure 3 This is a schematic diagram of a cross-granularity evidence alignment strategy based on hierarchical relationships.

[0059] Figure 4 This is a schematic diagram of the rules for combining opinions based on uncertainty. Detailed Implementation

[0060] The technical solution of the present invention will be described in detail below with reference to the embodiments and accompanying drawings.

[0061] like Figure 1 The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network described in this invention mainly includes the following processes:

[0062] First, construct an uncertainty cross-granularity evidence feature fusion network, including an image feature extraction network, an evidence generation network, and an evidence-guided uncertainty estimation network;

[0063] Second, image feature extraction networks and evidence generation networks are used to obtain binary membership evidence pairs corresponding to each intent category of the test image at three granularities; the granularity refers to different levels of analyzing training images, including three different granularities: coarse, medium, and fine.

[0064] Third, perform cross-granularity evidence alignment based on granularity hierarchy relationships;

[0065] Because the number of intent categories varies at different granularities, the dimensions of binary membership classification results also differ. Therefore, the classification results of binary membership evidence pairs at different granularity levels are aligned into a unified form. The cross-granularity evidence alignment strategy based on hierarchical relationships includes constraints between coarser-grained categories and finer-grained categories, as well as supplementary relationships between finer-grained categories and coarser-grained categories. The cross-granularity information integration strategy groups subclasses belonging to the same coarser-grained category and integrates binary membership evidence belonging to the same group, aligning the form of the coarser-grained results.

[0066] Fourth, the evidence-guided uncertainty estimation network is used to assign confidence and overall uncertainty to the intention category k of the binary membership evidence pair, and subjective opinions on intention category k are formed based on confidence and overall uncertainty.

[0067] Fifth, based on the rules for combining opinions under uncertainty, subjective opinions from different granularities are integrated;

[0068] The subjective opinions from different granularities include opinions from the coarse-grained layer, opinions from the medium-grained layer, opinions from the medium-grained layer integrated into the coarse-grained layer, opinions from the fine-grained layer, and opinions from the fine-grained layer integrated into the medium-grained layer.

[0069] The fusion of opinions from different granularities includes the fusion of opinions of the same form but from different granularity layers. The opinions of the same form but from different granularity layers include the fusion of opinions from a coarse-grained layer and opinions from a medium-grained layer integrated into a coarse-grained layer form, and the fusion of opinions from a medium-grained layer and opinions from a fine-grained layer integrated into a medium-grained layer form.

[0070] Sixth, based on the binary evidence loss function, the constructed cross-granularity evidence feature fusion network is trained under supervision using the binary evidence loss function to optimize the network parameters.

[0071] Seventh, the visual intent understanding results of the test image are obtained by using the trained cross-granularity evidence feature fusion network, thereby obtaining the true intent behind the image.

[0072] The method of the present invention will be described in detail below with reference to Example 1.

[0073] Example 1

[0074] (1) Obtain shallow visual features from training images

[0075] like Figure 2As shown, the image feature extraction network is a ResNet101 trained on the large-scale object recognition dataset ImageNet, used to coarsely extract shallow visual information from images.

[0076] Specifically, given an image The feature extraction network obtains the feature map X0∈R of the image using the following formula: H×W×d′ :

[0077] X0 = f(x),

[0078] Where H0×W0 is the original resolution of the input image, and 3 is the original number of channels.

[0079] Through a linear projection layer, the feature map is projected into a reconstructed feature embedding X∈R. HW×d The feature embedding is the input of the evidence generation network at the three granularities in (2).

[0080] (2) Obtain binary membership evidence pairs corresponding to each intent category at three granularities.

[0081] The three-level evidence generation network comprises three Transformers with different dimensions, corresponding to three granularity layers with 28, 15, and 9 categories, respectively. By constructing Transformer blocks, evidence supporting each category is obtained, and the output evidence serves as the basis for subsequent uncertainty estimation.

[0082] The Transformer block mainly consists of two parts:

[0083] 1) Set the dimension of the Transformer representing the number of categories to K×2, where K represents the number of categories, to obtain a binary membership classification result for each category. The binary membership classification refers to the classification of whether a sample belongs to a certain category. The result is a two-dimensional vector, where the first element represents the score against belonging to this category, and the second element represents the score for belonging to this category. That is, the output of the Transformer is... in Expressing opposition falls into this category. This indicates support for belonging to this category.

[0084] 2) Add a non-negative activation layer (ReLU) at the end of the Transformer to obtain a non-negative evidence vector, specifically:

[0085]

[0086] That is, the output of the Transformer block is the evidence vector E = {e}. 1,e 2 ,…e k ,…,e K}, The evidence consists of indicators extracted from the input that support the classification. This represents a binary membership pair for category k. This indicates evidence that does not belong to this category. This indicates evidence that the item belongs to this category.

[0087] (3) Cross-granularity hierarchical representation of aligned evidence pairs

[0088] like Figure 3 As shown, the input image feature embedding is consistent for each granularity level, while the output dimension varies with the number of classes. There are three granularity layers with 28, 15, and 9 classes respectively, resulting in output evidence vector dimensions of 28×2, 15×2, and 9×2.

[0089] To integrate the different outputs at each granularity level, a novel cross-granularity evidence alignment strategy based on hierarchical relationships is proposed. This cross-granularity integration strategy integrates the output forms from finer-grained layers into output forms from coarser-grained layers. Based on the hierarchical relationship of multiple granular categories, subclasses belonging to the same coarse-grained class are grouped, and binary membership evidence belonging to the same group is integrated according to the following formula:

[0090]

[0091]

[0092] Where, k i This represents a subclass of category k at granularity level m.

[0093] The aforementioned evidence-based cross-granularity integration strategy guarantees two principles, including:

[0094] 1) As long as a subclass at the granularity level m-1 obtains a large membership evidence Then the current category at granularity m will obtain greater membership evidence.

[0095] 2) All subclasses at granularity m-1 must obtain substantial evidence of non-membership. Currently, only categories at the granularity level m can yield substantial evidence of non-membership.

[0096] The principle described is consistent with hierarchical relationships in the real world, which specifically include:

[0097] 1) Samples belonging to a finer-grained subclass must belong to the corresponding coarser-grained category.

[0098] 2) Conversely, samples that do not belong to a coarser-grained category certainly do not belong to any finer-grained subclass.

[0099] (4) Constructing an evidence-guided uncertainty estimation network

[0100] To address the issue of category ambiguity at different granularity levels, evidence-based uncertainty is introduced to reveal the degree of category ambiguity. Based on evidence-based deep learning, a Dirichlet distribution is used to model the distribution of class probabilities. This Dirichlet distribution consists of K parameters α = [α1, ..., α2]. K The representation is as shown in the following formula:

[0101]

[0102] Where S K It is a K-dimensional unit simplex, defined as B(α) is a K-dimensional polynomial beta function.

[0103] For each binary membership evidence pair, uncertainty estimation based on evidence-based deep learning is performed separately. Given an evidence vector E = {e...} from the Transformer... 1 ,e 2 ,…e k ,…,e K Subjective logic and evidence classification theory will pair evidence against evidence. With the parameters of the Dirichlet distribution Connect them. Specifically, and The relationship between them is:

[0104]

[0105] After obtaining the corresponding Dirichlet distribution at each granularity level, the subjective logic assigns confidence to the binary membership categories based on the Dirichlet distribution. Assign overall uncertainty u to category k k Together, they form subjective opinions. k :

[0106]

[0107] in u k ≥0.

[0108] Belief Quality and uncertainty quality u k The following equation can be used to obtain:

[0109]

[0110] in It is known as the Dirichlet strength.

[0111] From the above equation, it can be deduced that the more evidence obtained that opposes or supports the k-th type of intention, the higher the corresponding quality of the assigned belief. or The more, the better. Conversely, the fewer binary membership evidence pairs obtained for the k-th class, the greater the overall uncertainty in the assignment. k The higher the value, the better.

[0112] Given a subjective opinion, the probability of a binary class It is the average value of the corresponding Dirichlet distribution parameters, specifically:

[0113] (5) Integrating opinions from different granularities

[0114] like Figure 4 As shown, the Dempster-Shafer (DS) evidence theory allows for the combination of opinions from different sources into a comprehensive opinion that integrates all available evidence.

[0115] To address the binary membership classification problem, a novel opinion combination rule based on uncertainty is designed according to the DS evidence theory. Specifically, for intention category k (class k), the DS evidence theory identification framework can be expressed as:

[0116] Ω k ={F k ,T k},

[0117] Where F k This indicates that the sample does not belong to class k, T k This indicates that the sample belongs to class k. Ω k The power set of is composed of all its subsets and contains the following four elements:

[0118] P(Ω k )={φ,{F k},{T k},{F k ,T k}},

[0119] For A∈P(Ω) k If P(A)>0, A is called a focal element.

[0120] {F k} or {T k The probability of {F} can be directly obtained from the confidence level assigned to support either not belonging to or belonging to class k. Furthermore, {F} k,T k The set of events containing two contradictory events has a probability that can be represented by the overall uncertainty. The focus set F(Ω) k The definition is as follows:

[0121] F(Ω k )={{F k},{T k},{F k ,T k}},

[0122] Specifically, for class k, there are two independent opinions. It can be combined into a joint opinion As shown below:

[0123]

[0124]

[0125]

[0126]

[0127] in It is the amount of conflict between two opinions. It is a regularization factor, which essentially makes

[0128] The uncertainty-based opinion combination rule guarantees four principles, including:

[0129] 1) The uncertainty between the two opinions is low (u (1) and u (2) (All are small), the confidence level of the joint opinion is relatively high (u * Small).

[0130] 2) The uncertainty of both opinions is relatively high (u (1) and u (2) (both are large), the confidence level of the joint opinion is low (b) -,* ,b +,* Both small and u * big).

[0131] 3) There is only one confident opinion (only u) (1) small or u (2) (Small), the final opinion depends more on the opinion that one is confident in (the subjective opinion has low uncertainty u, i.e., confidence);

[0132] 4) If the two opinions conflict, C and u * All will increase accordingly.

[0133] (6) The intent understanding results obtained from the training images at three granularities are fed into the binary evidence loss function to train the entire network.

[0134] The intent understanding results obtained at the three granularities are the fused subjective opinions corresponding to each category at the three granularities, whereby... The score indicates that the intention does not belong to this category. u represents the score belonging to this intention category. k This represents the overall uncertainty of the classification; the fused joint opinion is fed into a loss function designed for binary evidence deep learning. The binary cross-entropy loss L... bce =-(y i logμ i +(1-y i log(1-μ) i After a simple modification, we obtain the binary cross-entropy loss for binary evidence deep learning:

[0135]

[0136] Where ψ(·) is the digamma function. The L bce This ensures that each sample receives more evidence for the correct category, but it doesn't guarantee less evidence for the incorrect category. Therefore, a KL divergence term is introduced into the loss:

[0137]

[0138] in In the prediction parameter α k Based on the Dirichlet distribution parameters after filtering out misleading evidence, Γ(·) is the gamma function. Therefore, the loss with this regularization term at granularity m can be expressed as:

[0139]

[0140] Where K m λ is the number of categories under granularity level m. t =min(1.0,t / 10)∈[0,1] is the decay coefficient, and t is the current training round. λ t Gradually increase the amount of attention to the KL divergence in the initial stages of training, so as to better explore the parameter space.

[0141] Step 7: Input the test image into the trained network to obtain the underlying intent understanding results.

[0142] This invention integrates evidence theory into the uncertainty framework, uses uncertainty to guide cross-granularity fusion, enhances the network's ability to represent cross-granularity information, greatly reduces the impact of intention category ambiguity, improves the comprehensive understanding of visual content, and thus improves the ability of human-computer visual interaction.

Claims

1. A visual intention understanding method based on an uncertainty cross-granularity evidence feature fusion network, characterized in that, Includes the following steps: Step 1: Construct an uncertainty-cross-granularity evidence feature fusion network, including an image feature extraction network, an evidence generation network, and an evidence-guided uncertainty estimation network; the evidence generation network includes evidence generation networks at various granularity levels, with shallow visual feature maps as input and output dimensions of... The last layer is a non-negative activation layer; where the granularity refers to the different levels of analyzing the training image, including three granularity levels: coarse, medium, and fine. Step 2: Obtain binary membership evidence pairs corresponding to each intent category at different granularities in the training images based on image feature extraction networks and evidence generation networks; Step 3: Align the cross-granularity hierarchical representation of binary membership evidence pairs; Step 4: Utilizing an evidence-guided uncertainty estimation network, assign confidence and overall uncertainty to the binary membership evidence pairs corresponding to each intent category at different granularities to generate subjective opinions for each intent category at different granularities. The process of assigning confidence and overall uncertainty is as follows: the evidence-guided uncertainty estimation network allocates the confidence and uncertainty of intent categories by combining the Dirichlet distribution with the binary membership evidence pairs. The combination refers to combining the parameters of the Dirichlet distribution... Binary membership evidence To establish a connection, follow these steps: , Intent categories based on the Dirichlet distribution as binary membership evidence pairs Partial configuration reliability: Distribution of overall uncertainty ; Intent categories are formed based on confidence level and overall uncertainty. Subjective opinions for: , in, For Dirichlet strength, , , For the quality of belief; Step 5: Integrate subjective opinions from different granularities to obtain intent understanding of training images at different granularities; Step 6: Feed the intent understanding results obtained from the training images at different granularities into the binary evidence loss function, and use the binary evidence loss function to supervise the training of the cross-granularity evidence feature fusion network in Step 1 to optimize the network parameters; Step 7: Input the test image into the trained cross-granularity evidence feature fusion network to obtain the visual intent understanding of the image, thereby obtaining the true intent behind the image.

2. The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network as described in claim 1, characterized in that, The image feature extraction network described in step 1 includes a sample-level image feature extractor, wherein the sample level refers to each sample corresponding to only a single feature map, and the feature extractor is a pre-trained convolutional neural network used to extract shallow visual feature maps of the training images.

3. The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network as described in claim 1, characterized in that, In step 2, an image feature extraction network and an evidence generation network are used to obtain binary membership evidence pairs corresponding to each intent category at different granularities in the training image. Specifically, the image feature extraction network is first used to extract shallow visual feature maps of the training image; based on the shallow visual feature maps, the evidence generation network generates binary membership evidence pairs corresponding to each intent category at different granularity levels; wherein the binary membership evidence pairs represent the feature information of the training image related to the intent category at different granularities, and each intent category represents a different meaning or goal expressed by the training image.

4. The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network as described in claim 2 or 3, characterized in that, The binary membership evidence pair mentioned in step 2 is represented as follows: Among them, binary membership evidence pairs include evidence supporting non-belonging to the category and evidence supporting belonging to the category; K represents the total number of intent categories. Referring to a certain category of intent, =1,2,...,K, This indicates that it does not belong to the category of intent. Evidence, Indicates the category of belonging intention Evidence.

5. The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network as described in claim 1, characterized in that, The cross-granularity hierarchical representation of the aligned evidence pairs described in step 3 is because the number of intent categories is different at different granularities, and the dimensions of the binary membership classification results are also different. Therefore, the classification results of binary membership evidence pairs at different granularities are aligned into a unified form. The alignment method includes a cross-granularity evidence alignment strategy based on hierarchical relationships, wherein the hierarchical relationships include the constraint relationship between coarse-grained categories and fine-grained categories, and the complementary relationship between fine-grained categories and coarse-grained categories. The cross-granularity evidence alignment includes: aligning fine-grained evidence pairs to a medium-grained form, and aligning medium-grained evidence pairs to a coarse-grained form. The cross-granularity evidence alignment strategy refers to grouping subclasses belonging to the same coarse or medium-grained class, and integrating binary membership evidence pairs belonging to the same group to align the coarse-grained results. The integration method is as follows: , , in, Indicated at the grain size layer Intent Category subclasses, This indicates that it does not belong to the category of intent. Evidence, Indicates the category of belonging intention Evidence.

6. The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network as described in claim 1, characterized in that, The opinions of different granularities mentioned in step 5 include opinions of coarse-grained layers, opinions of medium-grained layers, opinions of medium-grained layers integrated into coarse-grained layers, opinions of fine-grained layers, and opinions of fine-grained layers integrated into medium-grained layers. The fusion of subjective opinions from different granularities includes the fusion of opinions of the same form but from different granularity layers. Specifically, it includes the fusion of opinions from the coarse-grained layer and opinions from the medium-grained layer integrated into a coarse-grained layer form, and the fusion of opinions from the medium-grained layer and opinions from the fine-grained layer integrated into a medium-grained layer form. The fusion method includes two independent subjective opinions. Combined into a joint opinion The combination rules are as follows: , , , , in, It is the amount of conflict between two subjective opinions. ; It is a regularization factor; , For the confidence level of the joint opinion, Due to the uncertainty of the joint opinion; By fusing, we can obtain the intent understanding results of the training images at three granularities: coarse, medium, and fine. That is, the subjective opinions corresponding to each intent category at the three granularities are fused together. The intent understanding results at the coarse granularity include the joint opinions from the coarse granular layer and the opinions from the medium granular layer integrated into the coarse granular layer form. The intent understanding results at the medium granularity include the joint opinions from the medium granular layer and the opinions from the fine granular layer integrated into the coarse granular layer form. The intent understanding at the fine granularity includes the opinions from the fine granular layer.

7. The visual intent understanding method based on an uncertain cross-granularity evidence feature fusion network as described in claim 1, characterized in that, The formula for the binary evidence loss function mentioned in step 6 is: , in, The formula for the binary cross-entropy loss, which is modified for deep learning based on binary evidence, is as follows: , in, Indicates the training image with respect to the intent category The binary truth value, if the training image belongs to the intention category. , but If the training image does not belong to the intention category , but ; Here is the regularization term, and its formula is as follows: , in, It is a grain size layer The number of intent categories below It is the attenuation coefficient. This is the current training round; during training, Gradually increase the amount so that during the initial phase of training... Attention with less divergence allows the network to explore the parameter space more thoroughly.

8. A visual intent understanding system based on an uncertain cross-granularity evidence feature fusion network, characterized in that: Includes the following modules: Network construction module: Constructs an uncertainty cross-granularity evidence feature fusion network, including an image feature extraction network, an evidence generation network, and an evidence-guided uncertainty estimation network; the evidence generation network includes evidence generation networks at various granularity levels, with its input being shallow visual feature maps and its output dimension being... The last layer is a non-negative activation layer; where the granularity refers to the different levels of analyzing the training image, including three granularity levels: coarse, medium, and fine. Evidence Pair Generation and Alignment Module: Based on image feature extraction network and evidence generation network, it obtains binary membership evidence pairs corresponding to each intent category at different granularities in the training image; and aligns the cross-granularity hierarchical representation of binary membership evidence pairs. Opinion Generation Module: Utilizing an evidence-guided uncertainty estimation network, this module allocates confidence and overall uncertainty to binary membership evidence pairs corresponding to each intent category at different granularities, thereby generating subjective opinions for each intent category at different granularities. The process of allocating confidence and overall uncertainty involves the evidence-guided uncertainty estimation network using a Dirichlet distribution combined with binary membership evidence pairs to assign confidence and uncertainty to intent categories. This combination refers to combining the parameters of the Dirichlet distribution... Binary membership evidence To establish a connection, follow these steps: , Intent categories based on the Dirichlet distribution as binary membership evidence pairs Partial configuration reliability: Distribution of overall uncertainty ; Intent categories are formed based on confidence level and overall uncertainty. Subjective opinions for: , in, For Dirichlet strength, , , For the quality of belief; Opinion Fusion Module: Fuses subjective opinions from different granularities to obtain intent understanding of training images at different granularities; Network training module: The intent understanding results obtained from training images at different granularities are fed into the binary evidence loss function. The network is trained under supervision through cross-granularity evidence feature fusion using the binary evidence loss function, and the network parameters are optimized. Image Intent Acquisition Module: Input the test image into the trained cross-granularity evidence feature fusion network to obtain the visual intent understanding of the image, thereby obtaining the true intent behind the image.

9. A computer-readable storage medium for storing one or more programs, characterized in that: The one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods according to claims 1 to 7.