Image classification method and device, computer device and storage medium thereof

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By acquiring style-enhanced visual features and fuzzy text semantic features from images, and combining them with category and importance classifiers, the problem of classification accuracy when image data has high similarity is solved, thus achieving accurate image data classification.

CN116894974BActive Publication Date: 2026-06-23CHINA TELECOM CORP LTD TECHNOLOGY INNOVATION CENTER +1

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA TELECOM CORP LTD TECHNOLOGY INNOVATION CENTER
Filing Date: 2023-07-07
Publication Date: 2026-06-23

Application Information

Patent Timeline

07 Jul 2023

Application

23 Jun 2026

Publication

CN116894974B

IPC: G06V10/764; G06V10/80; G06V10/82; G06V20/62; G06V30/19

AI Tagging

Application Domain

Character and pattern recognition

Technology Topics

Feature extraction Classification methods

Technical Efficacy Phrases

Accurate classification processingavoid interference

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately distinguish and classify image data with high similarity, leading to inaccurate classification processing.

Method used

By acquiring style-enhanced visual features and fuzzy text semantic features from images, and utilizing category classifiers and importance classifiers, the system performs classification processing by combining style-enhanced visual features and fuzzy text semantic features to achieve category and importance classification.

Benefits of technology

When image data has high similarity, it can be accurately classified, improving the accuracy of image data classification and preventing interference in the classification process.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116894974B_ABST

Patent Text Reader

Abstract

The application relates to an image classification method and device, computer equipment and a storage medium thereof, and relates to the technical field of artificial intelligence. The method comprises the following steps: obtaining a style-enhanced visual feature of an image to be classified based on a feature extraction network; obtaining a fuzzy text semantic feature of the image to be classified based on a text semantic extraction network; and performing classification processing on the image to be classified according to the style-enhanced visual feature and the fuzzy text semantic feature based on a target classifier, so as to obtain a target classification result of the image to be classified. The application can accurately perform classification processing on image data even when different image data has high similarity, can prevent the classification processing of the image data from being disturbed, and can improve the accuracy of the classification processing of the image data.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to an image classification method, apparatus, computer device and its storage medium. Background Technology

[0002] With the continuous development of the social economy, the scale and output of many enterprises have increased significantly, resulting in an increasing amount of related data (such as image data with text information). In order to better manage and store the business situation of enterprises, existing technologies can use optical character recognition and natural language processing to identify the text semantic features in image data, and then classify the image data according to the text semantic features.

[0003] When classifying and managing image data, if different image data have highly similar textual semantic features, existing technologies may be unable to accurately distinguish the textual semantic features of different image data, thus preventing them from accurately classifying and processing the image data. Summary of the Invention

[0004] Therefore, it is necessary to provide an image classification method, apparatus, computer equipment, and storage medium that can accurately classify image data in order to address the above-mentioned technical problems.

[0005] Firstly, this application provides an image classification method. The method includes:

[0006] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0007] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0008] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0009] In one embodiment, the target classification result includes a category classification result and an importance classification result, and the target classifier includes a category classifier and an importance classifier;

[0010] Accordingly, based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified, including:

[0011] Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0012] Based on the importance classifier, the importance classification of the image to be classified is performed according to style-enhanced visual features and fuzzy text semantic features, thus obtaining the importance classification result of the image to be classified.

[0013] In one embodiment, based on an importance classifier, image importance classification is performed on style-enhanced visual features and fuzzy text semantic features to obtain the importance classification result of the image to be classified, including:

[0014] Based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain importance-level features; and based on the importance-level features, the images to be classified are classified according to their importance to obtain the importance classification results.

[0015] In one embodiment, style-enhanced visual features of the image to be classified are obtained based on a feature extraction network, including:

[0016] Based on a feature extraction network, visual and stylistic features of the image to be classified are obtained.

[0017] Style enhancement processing is performed on visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0018] In one embodiment, the feature extraction network includes a visual extraction network and a style extraction network; the visual extraction network has the same number of visual extraction layers as the style extraction network, and they correspond one-to-one.

[0019] Accordingly, based on the feature extraction network, visual and stylistic features of the image to be classified are obtained, including:

[0020] The image to be classified is input into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and the sub-visual features output by each visual extraction layer are fused to obtain the visual features of the image to be classified.

[0021] The sub-visual features output by each visual extraction layer are input into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer. The sub-style features output by each visual extraction layer are then fused to obtain the style features of the image to be classified.

[0022] In one embodiment, the sub-style features output by each visual extraction layer are fused to obtain the style features of the image to be classified, including:

[0023] The sub-style features output from each visual extraction layer are fused to obtain an intermediate feature map of the image to be classified.

[0024] The intermediate feature map is transformed to obtain the style features of the image to be classified.

[0025] In one embodiment, based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained, including:

[0026] Based on the text region detection network in the text semantic extraction network, text region detection is performed on the image to be classified to obtain text region slices of the image to be classified.

[0027] Based on the text parsing network in the text semantic extraction network, the text feature map of the text region slice is determined;

[0028] Based on the Long Short-Term Memory network in the text semantic extraction network, text semantics are extracted from the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0029] Secondly, this application also provides an image classification apparatus. The apparatus includes:

[0030] The first acquisition module is used to acquire style-enhanced visual features of the image to be classified based on the feature extraction network.

[0031] The second acquisition module is used to acquire fuzzy text semantic features of the image to be classified based on the text semantic extraction network.

[0032] The classification module is used to classify images based on a target classifier, using style-enhanced visual features and fuzzy text semantic features, to obtain the target classification result of the images to be classified.

[0033] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to perform the following steps:

[0034] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0035] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0036] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0037] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, performs the following steps:

[0038] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0039] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0040] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0041] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, performs the following steps:

[0042] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0043] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0044] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0045] The aforementioned image classification method, apparatus, computer equipment, and storage medium acquire style-enhanced visual features and fuzzy text semantic features of the image to be classified, and then determine the target classification result of the image based on these features. Since the style-enhanced visual features in the above process refer to the visual characteristics after style enhancement processing, the determination of the target classification result based on these features and fuzzy text semantic features achieves a holistic assessment of image style, providing supplementary style information to the visual features. This ensures accurate classification even when different image data have high similarity, preventing interference with the image classification process and improving the accuracy of image data classification. Attached Figure Description

[0046] Figure 1 An application environment diagram of an image classification method provided in this application embodiment;

[0047] Figure 2 A flowchart illustrating an image classification method provided in this application embodiment;

[0048] Figure 3 A flowchart illustrating the steps for determining a target classification result is provided in this application embodiment;

[0049] Figure 4 A flowchart illustrating the steps for determining style-enhanced visual features in an embodiment of this application;

[0050] Figure 5A flowchart illustrating the determination of visual features and style features, provided for embodiments of this application;

[0051] Figure 6 A flowchart illustrating the steps for determining the semantic features of fuzzy text, provided in this application embodiment;

[0052] Figure 7 A schematic diagram illustrating the determination of a text region slice provided in an embodiment of this application;

[0053] Figure 8 A flowchart for determining the semantic features of fuzzy text provided in an embodiment of this application;

[0054] Figure 9 A flowchart illustrating the training process of a Long Short-Term Memory (LSTM) network provided in this application embodiment;

[0055] Figure 10 A flowchart illustrating another image classification method provided in this application embodiment;

[0056] Figure 11 A flowchart illustrating the process of determining a target classification result, as provided in an embodiment of this application;

[0057] Figure 12 A structural block diagram of the first image classification device provided in the embodiments of this application;

[0058] Figure 13 This is a structural block diagram of a second image classification device provided in an embodiment of this application;

[0059] Figure 14 A structural block diagram of the third image classification device provided in the embodiments of this application;

[0060] Figure 15 A structural block diagram of the fourth image classification device provided in the embodiments of this application;

[0061] Figure 16 A structural block diagram of the fifth image classification device provided in the embodiments of this application;

[0062] Figure 17 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0063] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0064] It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit this application. In the description of this application, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0065] Based on the above, the image classification method provided in this application embodiment can be applied to, for example, Figure 1 In the application environment shown, in one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows. Figure 1 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores data acquired for image classification methods. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements an image classification method.

[0066] This application discloses an image classification method, apparatus, computer device and its storage medium. The computer device acquires style-enhanced visual features and fuzzy text semantic features of the image to be classified, and then performs classification processing on the image to be classified based on the style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0067] In one embodiment, such as Figure 2 As shown, Figure 2 This is a flowchart illustrating an image classification method provided in an embodiment of this application. Figure 1 Image classification methods performed by computer devices may include the following steps:

[0068] Step 201: Based on the feature extraction network, obtain the style-enhanced visual features of the image to be classified.

[0069] Style-enhanced visual features refer to visual features that have undergone style enhancement processing.

[0070] It should be noted that when it is necessary to determine the visual features of an image to be classified, the image to be classified can be input into the visual extraction network of the feature extraction network, and the output result of the visual extraction network can be obtained. This output result is the visual features of the image to be classified.

[0071] In one embodiment of this application, the visual extraction network includes multiple cascaded visual extraction layers (i.e., convolutional layers). Each visual extraction layer divides the input image to be classified into multiple overlapping sub-visual features (i.e., rectangular sensory regions). Then, by weighted combination processing of the pixel values in each sub-visual feature, the visual features of the image to be classified are obtained.

[0072] To further explain, before performing style enhancement processing on visual features, the style features of the image to be classified can be determined in advance through the style extraction network of the feature extraction network. Then, the visual features can be style-enhanced based on the style features to obtain the style-enhanced visual features of the image to be classified.

[0073] In one embodiment of this application, the style extraction network includes multiple parallel style extraction layers. The image to be classified can be input into the style extraction network, the style extraction layers obtain the sub-style features corresponding to the input image to be classified, and fuse the sub-style features to obtain the style features of the image to be classified.

[0074] To further explain, when performing style enhancement processing on visual features, the visual features of the image to be classified and the style features of the image to be classified can be spliced together to obtain the style-enhanced visual features of the image to be classified.

[0075] Step 202: Based on the text semantic extraction network, obtain the fuzzy text semantic features of the image to be classified.

[0076] Among them, fuzzy text semantic features refer to features that contain all text information in the image to be classified;

[0077] It should be noted that when it is necessary to determine the fuzzy text semantic features of an image to be classified, the text region detection of the image to be classified can be performed in advance to determine the text region slices containing text information in the image to be classified; the text slices can be parsed to determine the text feature map of the text region slices; and the text semantics can be extracted from the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0078] To further explain, text region detection can be performed on the image to be classified through the text region detection network in the text semantic extraction network. Specifically, the image to be classified is input into the text region detection network, and the output result of the text region detection network is obtained. The output result is the text region slice containing text information in the image to be classified.

[0079] To further explain, when parsing text region slices, the text parsing network in the text semantic extraction network can be used. Specifically, the text region slices are input into the text parsing network so that the text parsing network converts the text region slices into text feature maps of size (w, h, c), where w, h, and c represent the width, height, and number of channels of the text feature map, respectively.

[0080] To further explain, the fuzzy text semantic features of the image to be classified can be determined by using the Long Short-Term Memory (LSTM) network in the text semantic extraction network. Specifically, average pooling is performed on the text feature map of size (w, h, c) to obtain a text feature map of size (w, 1, c). The text feature map of size (w, 1, c) is regarded as a time series of length w, and this time series is input into the LSM network to obtain the memory unit features output by the LSM network at the final time. These memory unit features are the fuzzy text semantic features of the image to be classified.

[0081] Step 203: Based on the target classifier, the image to be classified is classified according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0082] It should be noted that the target classification results include category classification results and importance classification results. Category classification results represent the category of content in the image to be classified; for example, category classification results may include, but are not limited to, tables, network topology diagrams, system architecture diagrams, logos, etc. Importance classification results characterize whether the image to be classified contains important information; for example, importance classification results may include, but are not limited to, important images, unimportant images, etc.

[0083] In one embodiment of this application, the category classification result corresponding to the image to be classified can be determined based on the category classifier in the target classifier; specifically, the style-enhanced visual features are input into the category classifier, and the output result of the category classifier is obtained, which is the category classification result of the image to be classified.

[0084] Among them, the category classifier can determine the category classification result of the image to be classified based on the style-enhanced visual features of the image to be classified. Furthermore, the category classifier can be an MLP (Multilayer Perceptron) classifier.

[0085] In one embodiment of this application, the importance classification result corresponding to the image to be classified can be determined based on the importance classifier in the target classifier; specifically, style-enhanced visual features and fuzzy text semantic features are input into the importance classifier, and the output result of the importance classifier is obtained, which is the importance classification result of the image to be classified.

[0086] The importance classifier can determine the importance classification result of the image based on style-enhanced visual features and fuzzy text semantic features. Furthermore, the importance classifier can be an MLP (Multilayer Perceptron) classifier.

[0087] The aforementioned image classification method acquires style-enhanced visual features and fuzzy text semantic features of the image to be classified, and then determines the target classification result based on these features. Since the style-enhanced visual features refer to the visual characteristics after style enhancement processing, determining the target classification result based on these features and fuzzy text semantic features allows for a holistic assessment of image style and provides supplementary style information to the visual features. This method enables accurate classification even when different image data have high similarity, preventing interference with the classification process and improving overall accuracy.

[0088] As enterprises grow in size and output, the amount of related data generated also increases, such as image data containing text information. However, when different image data have a high degree of similarity, it becomes impossible to classify the image data using conventional computer vision techniques. To prevent this problem from hindering the classification of image data, the computer device in this embodiment can, as shown in the example... Figure 3 The method shown includes a category classification result and an importance classification result, and a category classifier and an importance classifier. Correspondingly, based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result. Specifically, the steps include:

[0089] Step 301: Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0090] In one embodiment of this application, when it is necessary to determine the category classification result of an image to be classified, style-enhanced visual features can be input into a category classifier, and the output result of the category classifier can be obtained. The output result is the category classification result of the image to be classified.

[0091] It should be noted that before determining the category classification result of the image to be classified based on the category classifier, the category classifier needs to be trained based on the classification training samples to ensure that the category classifier can enhance visual features according to style and determine the category classification result of the image to be classified.

[0092] The classification training samples can be: style-enhanced visual features of the sample image, and the category classification result corresponding to the visual features of the sample.

[0093] The training process for the category classifier can be either supervised or unsupervised; no specific method is specified for training the category classifier here.

[0094] Step 302: Based on the importance classifier, the image to be classified is classified according to the style-enhanced visual features and fuzzy text semantic features to obtain the importance classification result of the image to be classified.

[0095] It should be noted that when it is necessary to determine the importance classification result of the image to be classified, the following can be included: based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain the importance classification features; and based on the importance classification features, the image to be classified is classified to obtain the importance classification result.

[0096] To further explain, concatenating style-enhanced visual features and fuzzy text semantic features is equivalent to concatenating style-enhanced visual features and fuzzy text semantic features with dimensional expansion. This can be understood as follows: the dimension of the style-enhanced visual features is d. x ,Right now The dimension of fuzzy text semantic features is d. y ,Right now Therefore, the dimension of the importance ranking feature after concatenation is: d z =d x +d y .

[0097] For example, if both the style-enhanced visual feature and the fuzzy text semantic feature are 5-dimensional features, i.e., style-enhanced visual feature = [1, 1, 1, 1, 1] and fuzzy text semantic feature = [2, 2, 2, 2, 2], then the concatenated importance ranking feature = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2].

[0098] For example, if the style enhancement visual features are 512-dimensional features and the fuzzy text semantic features are 256-dimensional features, then the dimension of the concatenated importance grading features is 512 + 256 = 768 dimensions.

[0099] In one embodiment of this application, when it is necessary to determine the importance classification result of an image to be classified, the importance ranking features can be input into the importance classifier, and the output result of the importance classifier can be obtained. The output result is the target classification result of the image to be classified.

[0100] To further explain, before determining the importance classification result of the image to be classified based on the importance classifier, importance training samples can be obtained in advance to train the importance classifier, so as to ensure that the importance classifier can determine the importance classification result of the image to be classified based on style-enhanced visual features and fuzzy text semantic features.

[0101] The importance training samples can be: style-enhanced visual features of the sample image, fuzzy textual semantic features of the sample image, and the importance classification results corresponding to the visual features of the sample.

[0102] The image classification method described above, through a category classifier and an importance classifier, determines the category and importance classification results of the image to be classified based on style-enhanced visual features and fuzzy text semantic features. This enables the acquisition of the category to which the image belongs and the importance of the image to be classified, thus achieving accurate classification processing of image data.

[0103] In one embodiment, style enhancement processing can be performed on visual features through style features to ensure that the visual features of the image to be classified are enhanced according to the style, thereby determining the target classification result of the image to be classified. Specifically, the process of obtaining style-enhanced visual features of the image to be classified based on a feature extraction network is as follows: Figure 4 As shown, the method includes:

[0104] Step 401: Based on the feature extraction network, obtain the visual features and style features of the image to be classified.

[0105] The feature extraction network includes a visual extraction network and a style extraction network; the visual extraction network has the same number of visual extraction layers as the style extraction network, and they correspond one-to-one.

[0106] It should be noted that when it is necessary to obtain the visual features and style features of the image to be classified, the following can be included: inputting the image to be classified into a visual extraction network to obtain the sub-visual features output by each visual extraction layer; fusing the sub-visual features output by each visual extraction layer to obtain the visual features of the image to be classified; inputting the sub-visual features output by each visual extraction layer into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer; and fusing the sub-style features output by each visual extraction layer to obtain the style features of the image to be classified.

[0107] In one embodiment of this application, such as Figure 5 As shown, convolutional layer 1, max pooling layer, residual structure 1, residual structure 2, residual structure 3, and residual structure 4 in the visual extraction network are the visual extraction layers of the visual extraction network; style extraction layers 1, 2, 3, 4, 5, and 6 in the style extraction network are the style extraction layers that correspond one-to-one with the visual extraction layers of the visual extraction network. Specifically, the image to be classified is input into the visual extraction network, where the visual extraction layers output sub-visual features of the image to be classified. The pixel values in the sub-visual features are weighted and combined using an average pooling layer and a unfolded + fully connected layer to obtain the visual features of the image to be classified. Furthermore, the sub-visual features output by each visual extraction layer are input into the corresponding style extraction layer, and the sub-style features output by each visual extraction layer are fused to obtain an intermediate feature map of the image to be classified. The result of processing the intermediate feature map using residual structures, average pooling layers, and unfolded + fully connected layers is the style feature.

[0108] As one implementation, the sub-visual features after convolution are feature maps of size (w, h, c), where w is the width of the feature map, h is the height of the feature map, and c is the number of channels in the feature map. The feature maps of each channel (i channels in total) are flattened to form w*h dimensional feature vectors, denoted as F. i Since i∈{1,2,...,c}, the substyle features (i.e., the Gram matrix obtained after Gram matrix operations) are:

[0109]

[0110] Where G represents the sub-style feature, which is the Gram matrix; S ij Represents the eigenvector F i With F j The inner product of ; c is the number of channels of the sub-visual features after convolution processing.

[0111] To further explain, when it is necessary to fuse the sub-style features output by each visual extraction layer to obtain the style features of the image to be classified, the following can be included: fusing the sub-style features output by each visual extraction layer to obtain an intermediate feature map of the image to be classified; and performing feature transformation on the intermediate feature map to obtain the style features of the image to be classified.

[0112] As an example, fusing sub-style features involves channel concatenation of the Gram matrix. Specifically: Figure 5 As shown, there are a total of six sub-style features. Each sub-style feature is expanded once, that is, the sub-style feature G is expanded from (c, c) to (c, c, 1), and the expanded dimension is concatenated to obtain an intermediate feature map of size (c, c, 6), where c is the number of channels of the sub-visual feature after convolution.

[0113] It should be noted that the number of sub-style features (i.e., the number of Gram matrices) varies depending on the type of visual extraction network and the type of style extraction network. For example, if the visual extraction network is ResNet-18 (a deep neural network), then the number of sub-style features is six.

[0114] Step 402: Perform style enhancement processing on the visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0115] It should be noted that when style enhancement processing of visual features is required based on style features, the style features and visual features are spliced together, that is, the style features and visual features are spliced together with dimensional expansion.

[0116] For example, if both style features and visual features are 5-dimensional features, i.e. style features = [3, 3, 3, 3, 3] and visual features = [4, 4, 4, 4, 4], then the style-enhanced visual features after splicing are [3, 3, 3, 3, 3, 4, 4, 4, 4, 4].

[0117] The image classification method described above acquires the visual and style features of the image to be classified, and then performs style enhancement processing on the visual features based on the style features to obtain style-enhanced visual features. This ensures that even when different image data have a high degree of similarity, the image data can still be accurately classified, preventing interference in the image data classification process and improving the accuracy of image data classification.

[0118] In one embodiment, the fuzzy text semantic features of the image to be classified can be obtained through the text region detection network, text parsing network, and long short-term memory network in the text semantic extraction network. Specifically, the process of obtaining the fuzzy text semantic features of the image to be classified based on the text semantic extraction network is as follows: Figure 6 As shown, the method includes:

[0119] Step 601: Based on the text region detection network in the text semantic extraction network, perform text region detection on the image to be classified to obtain text region slices of the image to be classified.

[0120] Among them, text region slices refer to region slices containing text information in the image to be classified.

[0121] It should be noted that the image to be classified may contain multiple text region slices, such as Figure 7 As shown, the image A to be classified is input into a text region detection network, and the output of the text region detection network is obtained. This output is the text region slice of the image to be classified, as shown below. Figure 7 As shown, the image A to be classified contains four text region slices.

[0122] Among them, text region detection network refers to a network that can identify regions containing text information in an input image and segment the regions containing text information. Text region detection network can include, but is not limited to: CRAFT text detection network, CRNN (a recurrent convolutional neural network model), etc.

[0123] Step 602: Based on the text parsing network in the text semantic extraction network, determine the text feature map of the text region slice.

[0124] It should be noted that when it is necessary to determine the text feature map of a text region slice, the text region slice can be input into the text parsing network, so that the text parsing network can transform the text region slice into a text feature map of size (w, h, c), where w is the width of the text feature map, h is the height of the text feature map, and c is the number of channels of the text feature map.

[0125] To further explain, if there are multiple text region slices in the image to be classified, each text region slice is input into the text parsing network, so that the text parsing network converts each text region slice into a text feature map of size (w, h, c).

[0126] Step 603: Based on the Long Short-Term Memory network in the text semantic extraction network, perform text semantic extraction on the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0127] It should be noted that when determining the fuzzy textual semantic features of an image to be classified, a text feature map of size (w, h, c) can be converted into a text feature map of size (w, h, c) through average pooling, where w is the width of the text feature map and c is the number of channels. This text feature map is then considered as a time series of length w, represented as {X1, ..., X...} t ,...,X w}, where X t Let represent the t-th c-dimensional feature vector in the feature map. The time series is then input into a Long Short-Term Memory (LSTM) network to obtain the memory unit features output by the LTM network at the final time step. These memory unit features are the fuzzy text semantic features of the image to be classified.

[0128] To further explain, if the text region slices of the image to be classified are multiple, then the fuzzy text semantic features of each text region slice are determined, and the mean of the fuzzy text semantic features of each text region slice is calculated. The result is the fuzzy text semantic features of the image to be classified.

[0129] In one embodiment of this application, when it is necessary to determine the fuzzy textual semantic features of an image to be classified, the following may be included: Figure 8 As shown, the image to be classified is input into a text region detection network to obtain at least one text region slice. Each text region slice is then input into a text parsing network to determine the text feature map of each text region slice. The text feature map of each text region slice is then input into a long short-term memory network to obtain the fuzzy text semantic features of each text region slice output by the long short-term memory network. The fuzzy text semantic features of each text region slice are then averaged to obtain the fuzzy text semantic features of the image to be classified.

[0130] It should be noted that the temporal output features Y of the Long Short-Term Memory (LSTM) network can be obtained by inputting a sample dataset into the LTM network. t Then, based on the time series output feature Y t The long short-term memory network is classified and trained to obtain the trained long short-term memory network.

[0131] The sample dataset can be a text classification dataset. Furthermore, the text classification dataset can include, but is not limited to, the IMDB movie review sentiment analysis dataset and the SMS Spam Collection spam text message classification dataset.

[0132] In one embodiment of this application, when training a Long Short-Term Memory (LSTM) network is required, the following may be included: Figure 9As shown, the text feature map of the sample dataset is determined by the text parsing network; the text feature map of the sample dataset is processed by average pooling and then input into the long short-term memory network to obtain the temporal output features of the long short-term memory network. The temporal output features are used as the classification basis to train the long short-term memory network for classification, and the trained long short-term memory network is obtained.

[0133] The image classification method described above determines the text feature maps of the text regions within the image to be classified by slicing the text regions. Based on these text feature maps, the fuzzy text semantic features of the image are then determined. Since these fuzzy text semantic features are determined from the text region slices, inaccurate optical character recognition (OCR) results prevent the inability to accurately obtain the importance classification result of the image. This ensures that when it is necessary to determine the importance classification result of an image, image importance can be classified based on style-enhanced visual features and fuzzy text semantic features, thereby determining the final importance classification result.

[0134] In one embodiment, when it is necessary to determine the target classification result of an image to be classified, such as Figure 10 As shown, the method includes the following steps:

[0135] Step 1001: Input the image to be classified into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and fuse the sub-visual features output by each visual extraction layer to obtain the visual features of the image to be classified.

[0136] Step 1002: Input the sub-visual features output by each visual extraction layer into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer.

[0137] Step 1003: Fuse the sub-style features output from each visual extraction layer to obtain the intermediate feature map of the image to be classified.

[0138] Step 1004: Perform feature transformation on the intermediate feature map to obtain the style features of the image to be classified.

[0139] Step 1005: Perform style enhancement processing on the visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0140] Step 1006: Based on the text region detection network in the text semantic extraction network, perform text region detection on the image to be classified to obtain text region slices of the image to be classified.

[0141] Step 1007: Based on the text parsing network in the text semantic extraction network, determine the text feature map of the text region slice.

[0142] Step 1008: Based on the Long Short-Term Memory network in the text semantic extraction network, perform text semantic extraction on the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0143] Step 1009: Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0144] Step 1010: Based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain importance-level features; and based on the importance-level features, the image to be classified is classified by image importance to obtain the importance classification result.

[0145] In one embodiment of this application, such as Figure 11 As shown, when it is necessary to determine the target classification result of an image to be classified, the image to be classified can be input into a visual extraction network. Then, based on the visual extraction network and the style extraction network, style-enhanced visual features are obtained. Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features, resulting in the category classification result. The image to be classified is input into a text region detection network to obtain text region slices. The text region slices are input into a text parsing network to obtain text feature maps of the text region slices. The text feature maps of the text region slices are input into a long short-term memory network to obtain fuzzy text semantic features of the image to be classified. The fuzzy text semantic features and style-enhanced visual features are concatenated to obtain importance ranking features. Based on the importance classifier, the image to be classified is classified according to the importance ranking features, resulting in the importance classification result.

[0146] The aforementioned image classification method acquires style-enhanced visual features and fuzzy text semantic features of the image to be classified, and then determines the target classification result based on these features. Since the style-enhanced visual features refer to the visual characteristics after style enhancement processing, determining the target classification result based on these features and fuzzy text semantic features allows for a holistic assessment of image style and provides supplementary style information to the visual features. This method enables accurate classification even when different image data have high similarity, preventing interference with the classification process and improving overall accuracy.

[0147] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0148] Based on the same inventive concept, this application also provides an image classification apparatus for implementing the image classification method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more image classification apparatus embodiments provided below can be found in the limitations of the image classification method described above, and will not be repeated here.

[0149] In one embodiment, such as Figure 12 As shown, an image classification device is provided, including: a first acquisition module 10, a second acquisition module 20, and a classification module 30, wherein:

[0150] The first acquisition module 10 is used to acquire style-enhanced visual features of the image to be classified based on the feature extraction network.

[0151] The second acquisition module 20 is used to acquire fuzzy text semantic features of the image to be classified based on the text semantic extraction network.

[0152] The classification module 30 is used to classify the image to be classified based on the target classifier, according to style-enhanced visual features and fuzzy text semantic features, to obtain the target classification result of the image to be classified.

[0153] The aforementioned image classification device acquires style-enhanced visual features and fuzzy text semantic features of the image to be classified, and then determines the target classification result based on these features. Since the style-enhanced visual features refer to the visual characteristics after style enhancement processing, determining the target classification result based on these features and fuzzy text semantic features allows for a holistic assessment of image style and provides supplementary style information to the visual features. This enables accurate classification even when different image data have high similarity, preventing interference with the classification process and improving its accuracy.

[0154] In one embodiment, such as Figure 13 As shown, an image classification device is provided. The classification module 30 in this device includes: a first classification unit 31 and a second classification unit 32, wherein:

[0155] The first classification unit 31 is used to classify the image to be classified based on the category classifier and the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0156] The second classification unit 32 is used to classify the importance of the image to be classified based on the importance classifier, according to style-enhanced visual features and fuzzy text semantic features, and obtain the importance classification result of the image to be classified.

[0157] The second classification unit is specifically used for: performing feature concatenation processing on style-enhanced visual features and fuzzy text semantic features based on the importance classifier to obtain importance-level features; and performing image importance classification on the image to be classified based on the importance-level features to obtain the importance classification result.

[0158] In one embodiment, such as Figure 14 As shown, an image classification device is provided. The first acquisition module 10 of this image classification device includes: an acquisition unit 11 and a processing unit 12, wherein:

[0159] The acquisition unit 11 is used to acquire the visual features and style features of the image to be classified based on the feature extraction network.

[0160] The processing unit 12 is used to perform style enhancement processing on visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0161] In one embodiment, such as Figure 15As shown, an image classification device is provided. The acquisition unit 11 in the image classification device includes: a first extraction subunit 111 and a second extraction subunit 112, wherein:

[0162] The first extraction subunit 111 is used to input the image to be classified into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and to fuse the sub-visual features output by each visual extraction layer to obtain the visual features of the image to be classified.

[0163] The second extraction subunit 112 is used to input the sub-visual features output by each visual extraction layer into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer, and to fuse the sub-style features output by each visual extraction layer to obtain the style features of the image to be classified.

[0164] The second extraction subunit is specifically used to: fuse the sub-style features output by each visual extraction layer to obtain an intermediate feature map of the image to be classified; and perform feature transformation on the intermediate feature map to obtain the style features of the image to be classified.

[0165] In one embodiment, such as Figure 16 As shown, an image classification device is provided. The second acquisition module 20 of this image classification device includes: a detection unit 21, a parsing unit 22, and an extraction unit 23, wherein:

[0166] The detection unit 21 is used to perform text region detection on the image to be classified based on the text region detection network in the text semantic extraction network, and obtain text region slices of the image to be classified.

[0167] The parsing unit 22 is used to determine the text feature map of the text region slice based on the text parsing network in the text semantic extraction network.

[0168] Extraction unit 23 is used to extract text semantics from text feature maps based on the long short-term memory network in the text semantic extraction network, so as to obtain fuzzy text semantic features of the image to be classified.

[0169] Each module in the aforementioned image classification device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0170] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 17As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When executed by the processor, the computer program implements an image classification method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0171] Those skilled in the art will understand that Figure 17 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0172] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0173] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0174] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0175] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0176] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0177] Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0178] Based on the importance classifier, the importance classification of the image to be classified is performed according to style-enhanced visual features and fuzzy text semantic features, thus obtaining the importance classification result of the image to be classified.

[0179] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0180] Based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain importance-level features; and based on the importance-level features, the images to be classified are classified according to their importance to obtain the importance classification results.

[0181] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0182] Based on a feature extraction network, visual and stylistic features of the image to be classified are obtained.

[0183] Style enhancement processing is performed on visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0184] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0185] The image to be classified is input into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and the sub-visual features output by each visual extraction layer are fused to obtain the visual features of the image to be classified.

[0186] The sub-visual features output by each visual extraction layer are input into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer. The sub-style features output by each visual extraction layer are then fused to obtain the style features of the image to be classified.

[0187] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0188] The sub-style features output from each visual extraction layer are fused to obtain an intermediate feature map of the image to be classified.

[0189] The intermediate feature map is transformed to obtain the style features of the image to be classified.

[0190] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0191] Based on the text region detection network in the text semantic extraction network, text region detection is performed on the image to be classified to obtain text region slices of the image to be classified.

[0192] Based on the text parsing network in the text semantic extraction network, the text feature map of the text region slice is determined;

[0193] Based on the Long Short-Term Memory network in the text semantic extraction network, text semantics are extracted from the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0194] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor:

[0195] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0196] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0197] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0198] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0199] Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0200] Based on the importance classifier, the importance classification of the image to be classified is performed according to style-enhanced visual features and fuzzy text semantic features, thus obtaining the importance classification result of the image to be classified.

[0201] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0202] Based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain importance-level features; and based on the importance-level features, the images to be classified are classified according to their importance to obtain the importance classification results.

[0203] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0204] Based on a feature extraction network, visual and stylistic features of the image to be classified are obtained.

[0205] Style enhancement processing is performed on visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0206] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0207] The image to be classified is input into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and the sub-visual features output by each visual extraction layer are fused to obtain the visual features of the image to be classified.

[0208] The sub-visual features output by each visual extraction layer are input into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer. The sub-style features output by each visual extraction layer are then fused to obtain the style features of the image to be classified.

[0209] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0210] The sub-style features output from each visual extraction layer are fused to obtain an intermediate feature map of the image to be classified.

[0211] The intermediate feature map is transformed to obtain the style features of the image to be classified.

[0212] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0213] Based on the text region detection network in the text semantic extraction network, text region detection is performed on the image to be classified to obtain text region slices of the image to be classified.

[0214] Based on the text parsing network in the text semantic extraction network, the text feature map of the text region slice is determined;

[0215] Based on the Long Short-Term Memory network in the text semantic extraction network, text semantics are extracted from the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0216] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, performs the following steps:

[0217] Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained;

[0218] Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained;

[0219] Based on the target classifier, the image to be classified is processed according to style-enhanced visual features and fuzzy text semantic features to obtain the target classification result of the image to be classified.

[0220] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0221] Based on the category classifier, the image to be classified is classified according to the style-enhanced visual features to obtain the category classification result of the image to be classified.

[0222] Based on the importance classifier, the importance classification of the image to be classified is performed according to style-enhanced visual features and fuzzy text semantic features, thus obtaining the importance classification result of the image to be classified.

[0223] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0224] Based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain importance-level features; and based on the importance-level features, the images to be classified are classified according to their importance to obtain the importance classification results.

[0225] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0226] Based on a feature extraction network, visual and stylistic features of the image to be classified are obtained.

[0227] Style enhancement processing is performed on visual features based on style features to obtain style-enhanced visual features of the image to be classified.

[0228] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0229] The image to be classified is input into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and the sub-visual features output by each visual extraction layer are fused to obtain the visual features of the image to be classified.

[0230] The sub-visual features output by each visual extraction layer are input into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer. The sub-style features output by each visual extraction layer are then fused to obtain the style features of the image to be classified.

[0231] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0232] The sub-style features output from each visual extraction layer are fused to obtain an intermediate feature map of the image to be classified.

[0233] The intermediate feature map is transformed to obtain the style features of the image to be classified.

[0234] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0235] Based on the text region detection network in the text semantic extraction network, text region detection is performed on the image to be classified to obtain text region slices of the image to be classified.

[0236] Based on the text parsing network in the text semantic extraction network, the text feature map of the text region slice is determined;

[0237] Based on the Long Short-Term Memory network in the text semantic extraction network, text semantics are extracted from the text feature map to obtain the fuzzy text semantic features of the image to be classified.

[0238] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for classification, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0239] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0240] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0241] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An image classification method, characterized in that, The method includes: Based on a feature extraction network, style-enhanced visual features of the image to be classified are obtained; the feature extraction network includes a visual extraction network and a style extraction network; the number of visual extraction layers in the visual extraction network and the number of style extraction layers in the style extraction network are the same and correspond one-to-one. Based on a text semantic extraction network, fuzzy text semantic features of the image to be classified are obtained; Based on the target classifier, the image to be classified is classified according to the style-enhanced visual features and the fuzzy text semantic features to obtain the target classification result of the image to be classified. The step of obtaining style-enhanced visual features of the image to be classified based on the feature extraction network includes: The image to be classified is input into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and the sub-visual features output by each visual extraction layer are fused to obtain the visual features of the image to be classified. The sub-visual features output by each visual extraction layer are input into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer. The sub-style features output by each visual extraction layer are then fused to obtain the style features of the image to be classified. The visual features are style-enhanced based on the style features to obtain the style-enhanced visual features of the image to be classified.

2. The method according to claim 1, characterized in that, The target classification result includes category classification result and importance classification result, and the target classifier includes category classifier and importance classifier; Accordingly, based on the target classifier, the image to be classified is classified according to the style-enhanced visual features and the fuzzy text semantic features to obtain the target classification result of the image to be classified, including: Based on the category classifier, and according to the style-enhanced visual features, the image to be classified is classified into categories to obtain the category classification result of the image to be classified. Based on the importance classifier, the image to be classified is classified according to the style-enhanced visual features and the fuzzy text semantic features, and the importance classification result of the image to be classified is obtained.

3. The method according to claim 2, characterized in that, The step of classifying the importance of the style-enhanced visual features and the fuzzy text semantic features based on the importance classifier to obtain the importance classification result of the image to be classified includes: Based on the importance classifier, the style-enhanced visual features and the fuzzy text semantic features are concatenated to obtain importance classification features; and based on the importance classification features, the image to be classified is classified by image importance to obtain the importance classification result.

4. The method according to claim 1, characterized in that, The process of fusing the sub-style features output from each visual extraction layer to obtain the style features of the image to be classified includes: The sub-style features output from each visual extraction layer are fused to obtain the intermediate feature map of the image to be classified. The intermediate feature map is transformed to obtain the style features of the image to be classified.

5. The method according to any one of claims 1-4, characterized in that, The text semantic extraction network is used to obtain fuzzy text semantic features of the image to be classified, including: Based on the text region detection network in the text semantic extraction network, text region detection is performed on the image to be classified to obtain text region slices of the image to be classified. Based on the text parsing network in the text semantic extraction network, the text feature map of the text region slice is determined; Based on the Long Short-Term Memory network in the text semantic extraction network, text semantic extraction is performed on the text feature map to obtain the fuzzy text semantic features of the image to be classified.

6. The method according to any one of claims 1-4, characterized in that, The process of performing style enhancement processing on the visual features based on the style features to obtain style-enhanced visual features of the image to be classified includes: The visual features and style features are combined to obtain the style-enhanced visual features of the image to be classified.

7. An image classification device, characterized in that, The device includes: The first acquisition module is used to acquire style-enhanced visual features of the image to be classified based on a feature extraction network; the feature extraction network includes a visual extraction network and a style extraction network; the number of visual extraction layers in the visual extraction network is the same as the number of style extraction layers in the style extraction network, and they correspond one-to-one. The second acquisition module is used to acquire fuzzy text semantic features of the image to be classified based on the text semantic extraction network. The classification module is used to classify the image to be classified based on the target classifier, according to the style-enhanced visual features and the fuzzy text semantic features, to obtain the target classification result of the image to be classified. The step of obtaining style-enhanced visual features of the image to be classified based on the feature extraction network includes: The image to be classified is input into the visual extraction network to obtain the sub-visual features output by each visual extraction layer; and the sub-visual features output by each visual extraction layer are fused to obtain the visual features of the image to be classified. The sub-visual features output by each visual extraction layer are input into the style extraction layer corresponding to each visual extraction layer to obtain the sub-style features output by each visual extraction layer. The sub-style features output by each visual extraction layer are then fused to obtain the style features of the image to be classified. The visual features are style-enhanced based on the style features to obtain the style-enhanced visual features of the image to be classified.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

Citation Information

Patent Citations

Image classification method and system based on cross-modal semantic representation learning and fusion
CN114898156A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Image classification method and system based on cross-modal semantic representation learning and fusion