Method, apparatus, device, storage medium and computer program product for information recognition
By extracting real-time descriptive content from live video and performing feature extraction and anomaly identification, the problem of efficiency and accuracy in identifying false advertising in live e-commerce has been solved, achieving automated false advertising detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2024-12-17
- Publication Date
- 2026-06-19
AI Technical Summary
In live-stream e-commerce, consumers cannot determine in real time whether product information is falsely advertised. Existing technologies have low keyword recall efficiency and subjective errors in human identification, resulting in insufficient information identification efficiency and accuracy.
By extracting real-time descriptive content from videos, performing feature extraction and anomaly description identification, and utilizing predefined descriptive features to identify target descriptive features, automated identification of product information in live streams can be achieved.
It improves the efficiency and accuracy of information identification, can automatically identify false advertising, reduce human intervention, and enhance the standardization of live-streaming e-commerce.
Smart Images

Figure CN122243513A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of e-commerce technology, and in particular to a method, apparatus, computer equipment, storage medium, and computer program product for information identification. Background Technology
[0002] With the development of e-commerce technology and the rise of e-commerce platforms, online consumption has become increasingly important in people's lives. Live-streaming e-commerce has become the mainstream of online consumption development, and the trend of dynamic product information display is becoming increasingly apparent. However, because consumers cannot physically see the products during live streams and cannot determine whether the live stream involves false advertising, the problem of false advertising regarding product information frequently occurs in live-streaming e-commerce.
[0003] Currently, the process typically involves keyword retrieval of product information, followed by manual comparison to verify if the promotional information in the live stream matches the actual product information. This manual review determines whether the product information constitutes false advertising. However, product information is diverse, and the descriptions of products during live streams are often colloquial, leading to potential omissions in keyword retrieval. Furthermore, manual identification suffers from subjective judgment errors and processing inefficiencies. Therefore, improving the efficiency and accuracy of information identification is a pressing issue that needs to be addressed. Summary of the Invention
[0004] Therefore, it is necessary to provide a method, apparatus, computer equipment, storage medium, and computer program product for information recognition that improves the efficiency and accuracy of information recognition, in response to the above-mentioned technical problems.
[0005] Firstly, this application provides a method for information identification. The method includes:
[0006] Extract real-time descriptions of the current object from videos that describe and display multiple objects separately;
[0007] Feature extraction is performed on the real-time description content to obtain the real-time description features of the currently displayed object;
[0008] Based on real-time description features, target description features predefined for the current display object are identified from the predefined description features for each display object in the video, and the display objects of the video include the current display object;
[0009] Based on predefined description content representing target description features, abnormal description identification is performed on real-time description content to obtain abnormal description identification results for the currently displayed object.
[0010] Secondly, this application also provides an information identification device. The device includes:
[0011] The content extraction module is used to extract real-time descriptive content for the current display object from videos that describe and display multiple display objects respectively;
[0012] The feature extraction module is used to extract features from the real-time description content to obtain the real-time description features of the currently displayed object;
[0013] The feature recognition module is used to identify, based on real-time description features, a target description feature predefined for the current display object from predefined description features for each display object in the video, wherein the display objects in the video include the current display object;
[0014] The information recognition module is used to identify abnormal descriptions in real-time descriptions based on predefined descriptions that characterize the features of the target description, and to obtain the abnormal description recognition results for the currently displayed object.
[0015] In one embodiment, the real-time description content includes at least real-time description text and real-time displayed images;
[0016] The feature extraction module is specifically used to extract text features from real-time descriptive text to obtain the text features of the currently displayed object; to extract image features from real-time displayed images to obtain the image features of the currently displayed object; and to perform feature concatenation processing on the text features and image features to obtain the real-time descriptive features of the currently displayed object.
[0017] In one embodiment, the object description method for the currently displayed object in the video includes language description and subtitle description; the real-time description content includes at least real-time description text;
[0018] The feature extraction module is specifically used to extract audio from videos that display multiple objects through language descriptions, extract the audio describing the current object in real time, and convert the audio into text to obtain the first real-time description text for the current object; perform text recognition on the subtitles displayed in the video for the current object to obtain the second real-time description text for the current object; and determine the first real-time description text and the second real-time description text as the real-time description text.
[0019] In one embodiment, the real-time description content includes at least a real-time display image of the currently displayed object;
[0020] The feature extraction module is specifically used to take screenshots of videos that display multiple objects through language descriptions, and determine the obtained video screenshots as real-time display images; or, to perform image frame extraction processing on videos that display multiple objects through language descriptions, and determine the obtained image video frames as real-time display images.
[0021] In one embodiment, the information recognition device further includes a prompt information acquisition module;
[0022] The prompt information acquisition module is used to acquire target prompt information that represents the display number of the displayed object in the video;
[0023] The feature recognition module is specifically used to determine the location of the target object that matches the current display object based on real-time description features, predefined description features of each displayed object in the video, and target prompt information; and to determine the predefined description features of the display object corresponding to the target object location as the predefined target description features for the current display object.
[0024] In one embodiment, the information recognition module is specifically used to determine predefined description content used to characterize the target description features; extract predefined attribute information from the predefined description content and extract real-time attribute information from the real-time description content; and perform abnormal description recognition on the real-time attribute information based on the predefined attribute information and the real-time attribute information to obtain the abnormal description recognition result for the currently displayed object.
[0025] In one embodiment, the information recognition module is specifically used to determine that the abnormal description recognition result of the currently displayed object is no abnormal description if the predefined attribute information is consistent with the real-time attribute information; and to determine that the abnormal description recognition result of the currently displayed object is that an abnormal description exists if the predefined attribute information is inconsistent with the real-time attribute information.
[0026] In one embodiment, the information identification device further includes an attribute type acquisition module;
[0027] The attribute type retrieval module is used to retrieve predefined attribute types that describe objects from multiple descriptive dimensions.
[0028] The information recognition module is specifically used to extract attributes from predefined description content according to multiple predefined attribute types to obtain predefined attribute information; the predefined attribute information includes attribute information of multiple predefined attribute types; and to extract attributes from real-time description content according to multiple predefined attribute types to obtain real-time attribute information; the real-time attribute information includes attribute information of multiple predefined attribute types.
[0029] In one embodiment, the anomaly description identification result also includes unknown attribute information;
[0030] The information recognition module is specifically used to identify the attribute information of the predefined attribute type that has an empty value in the predefined attribute information as the first unknown attribute information; to identify the attribute information of the predefined attribute type that has an empty value in the real-time attribute information as the second unknown attribute information; and to identify the attribute information of the matching predefined attribute type as unknown attribute information when the first unknown attribute information or the second unknown attribute information exists.
[0031] In one embodiment, the information recognition device further includes a descriptive feature acquisition module;
[0032] The descriptive feature acquisition module is used to acquire a predefined image and predefined text of the display object; to extract image features from the predefined image to obtain the predefined image features of the display object, and to extract text features from the predefined text to obtain the predefined text features of the display object; and to perform feature concatenation processing on the predefined image features and the predefined text features to obtain the predefined descriptive features of the display object.
[0033] In one embodiment, the feature acquisition module is specifically used to extract image features from multiple predefined sub-images included in the predefined image to obtain the predefined sub-image features of each predefined sub-image; and to perform feature fusion processing on the features of each predefined sub-image to obtain the predefined image features of the displayed object.
[0034] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to perform the following steps:
[0035] Extract real-time descriptions of the current object from videos that describe and display multiple objects separately;
[0036] Feature extraction is performed on the real-time description content to obtain the real-time description features of the currently displayed object;
[0037] Based on real-time description features, target description features predefined for the current display object are identified from the predefined description features for each display object in the video, and the display objects of the video include the current display object;
[0038] Based on predefined description content representing target description features, abnormal description identification is performed on real-time description content to obtain abnormal description identification results for the currently displayed object.
[0039] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, performs the following steps:
[0040] Extract real-time descriptions of the current object from videos that describe and display multiple objects separately;
[0041] Feature extraction is performed on the real-time description content to obtain the real-time description features of the currently displayed object;
[0042] Based on real-time description features, target description features predefined for the current display object are identified from the predefined description features for each display object in the video, and the display objects of the video include the current display object;
[0043] Based on predefined description content representing target description features, abnormal description identification is performed on real-time description content to obtain abnormal description identification results for the currently displayed object.
[0044] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, performs the following steps:
[0045] Extract real-time descriptions of the current object from videos that describe and display multiple objects separately;
[0046] Feature extraction is performed on the real-time description content to obtain the real-time description features of the currently displayed object;
[0047] Based on real-time description features, target description features predefined for the current display object are identified from the predefined description features for each display object in the video, and the display objects of the video include the current display object;
[0048] Based on predefined description content representing target description features, abnormal description identification is performed on real-time description content to obtain abnormal description identification results for the currently displayed object.
[0049] The aforementioned information recognition method, apparatus, computer equipment, storage medium, and computer program product extract real-time description content for the current display object from videos that describe and display multiple display objects respectively; perform feature extraction on the real-time description content to obtain real-time description features for the current display object; based on the real-time description features, identify predefined target description features for the current display object from predefined description features for each display object in the video, where the display objects in the video include the current display object; and perform abnormal description identification on the real-time description content based on the predefined description content characterizing the target description features to obtain an abnormal description identification result for the current display object. By extracting real-time description features for the current display object through this information recognition method, the real-time description content during the description and display of the current display object is characterized by these features. The real-time description features are then compared with predefined description features to determine the predefined target description features for the current display object, i.e., locating the current display object and determining the predefined description content predefined and registered for it. Finally, the real-time description content is compared with the predefined description content to determine whether there is any description content in the real-time description content that does not conform to the predefined description content, thereby completing abnormal description identification and improving the efficiency and accuracy of information recognition. Attached Figure Description
[0050] Figure 1 This is an application environment diagram of the information recognition method in one embodiment;
[0051] Figure 2 This is a flowchart illustrating the process of identifying false advertising information in one embodiment;
[0052] Figure 3 This is a flowchart illustrating an information recognition method in one embodiment;
[0053] Figure 4 This is a flowchart illustrating the process of determining target description features in one embodiment;
[0054] Figure 5 This is a flowchart illustrating the process of determining the location of a target object in one embodiment;
[0055] Figure 6 This is a flowchart illustrating the process of obtaining anomaly description identification results in one embodiment;
[0056] Figure 7 This is a schematic diagram of the process for obtaining predefined descriptive features in one embodiment;
[0057] Figure 8 This is a schematic diagram of the complete process of the information recognition method in one embodiment;
[0058] Figure 9This is a structural block diagram of an information recognition device in one embodiment;
[0059] Figure 10 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0060] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0061] With the development of e-commerce technology and the rise of e-commerce platforms, online consumption has become increasingly important in people's lives. Live-streaming e-commerce has become the mainstream of online consumption development, and the trend of dynamic product information display is becoming increasingly apparent. Because consumers cannot physically see the products during live streams and cannot determine whether the live stream involves false advertising, the problem of false advertising of product information frequently occurs in live-streaming e-commerce. Currently, it is usually necessary to use keyword retrieval for the product information to be identified, and then manually compare whether the information promoted in the live stream is the same as the actual product information, thereby manually reviewing whether the product information to be identified contains false advertising. In other words, manual review is required using keyword retrieval methods for the products to be identified. For example, when controlling the brand information of a product, words such as "product brand" or "product brand" are used to filter and find information that may contain false advertising of the product's brand. This potentially false advertising information is then sent to manual review, where manual comparison is made to see if the brand described in the promotional information is different from the registered brand of the product, thus marking whether false brand advertising exists. However, product information is diverse, including not only brand but also information such as origin and material. Furthermore, product descriptions during live streams tend to be conversational, which can lead to missed results when using keyword recall. Additionally, manual identification is susceptible to subjective judgment errors and processing inefficiencies. Therefore, improving the efficiency and accuracy of information identification is a pressing issue that needs to be addressed.
[0062] This application provides a method for efficient information recognition. The information recognition method provided in this application can be applied to, for example... Figure 1 In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. The data storage system can be integrated onto server 104, or it can be located in the cloud or on another server.
[0063] Specifically, taking server 104 as an example, server 104 extracts real-time description content for the current display object from videos that describe and display multiple display objects respectively. Then, it performs feature extraction on the real-time description content to obtain the real-time description features of the current display object. Based on the real-time description features, it identifies the predefined target description features for the current display object from the predefined description features for each display object in the video. The display objects in the video include the current display object. Finally, based on the predefined description content that represents the target description features, it performs abnormal description identification on the real-time description content to obtain the abnormal description identification result for the current display object.
[0064] The terminal 102 can be, but is not limited to, various desktop computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, and smart in-vehicle devices. Portable wearable devices can include smartwatches, smart bracelets, and head-mounted devices. The server 104 can be implemented using a standalone server or a server cluster composed of multiple servers. The information recognition method provided in this application embodiment can be applied to various scenarios, including but not limited to scenarios requiring abnormal information recognition such as video information recognition, video product promotional information recognition, and live-stream product promotional information recognition. Abnormal information recognition can include false or illegal promotional information.
[0065] To facilitate understanding, let's take an example from a scenario involving the dissemination of false advertising information, such as... Figure 2The flowchart illustrating the identification of false advertising information shows that the videos describing and displaying multiple objects are specifically live stream videos, and the objects being displayed are the products. Therefore, the currently displayed object is the currently displayed product. Based on this, real-time description content 202 is extracted for the currently displayed product. This involves extracting the audio describing the product and converting it into text, thus obtaining the first real-time description text for the product. Secondly, text recognition can be performed on the subtitles displayed for the product. These subtitles can be comments posted in the live stream comment area or descriptions of the product on the live stream screen, thus obtaining the second real-time description text for the product. Considering that the live stream video may also display images of the product, screenshots can be taken from the live stream video, and these screenshots can be used as the real-time display images of the product. Alternatively, frame extraction can be performed on the live stream video, and the resulting video frames can be used as the real-time display images of the product. Therefore, the real-time description content obtained at this time includes at least the first real-time description text, the second real-time description text, and the real-time display image. Therefore, feature extraction is performed on the real-time description content 202 to obtain the real-time description feature 204 of the currently displayed product.
[0066] Therefore, other products to be displayed in the current live stream can be defined as the display objects in this embodiment. Before the live stream, a corresponding product image should be stored for each product to be displayed. The product image is usually multiple images of the product to be displayed. The aforementioned multiple images are the predefined images of the product to be displayed. The system also stores relevant information such as the attribute information registered by the product to be displayed. The aforementioned attribute information registered by the product to be displayed is the predefined text of the product to be displayed. Therefore, by extracting image features from the multiple images included in the predefined image, predefined sub-image features of each image are obtained. Then, feature fusion processing is performed on the predefined sub-image features to obtain the predefined image features of the product to be displayed. Similarly, text feature extraction is performed on the predefined text to obtain the predefined text features of the product to be displayed. Then, feature concatenation processing is performed on the predefined image features to obtain the predefined descriptive features 206 of the product to be displayed.
[0067] Based on this, target prompt information representing the display number of the displayed product in the video is obtained. Then, based on real-time descriptive features, predefined descriptive features of each displayed product in the video, and the target prompt information, the position of the target product matching the currently displayed product is determined. That is, the position of the currently displayed product in the video is determined. Since the predefined descriptive features of each displayed product can be obtained through the aforementioned method, the predefined descriptive features of the displayed product corresponding to the target product position (i.e., the predefined descriptive features of the displayed product indicated by the position of the currently displayed product in the video) are determined as the predefined target descriptive feature 208 for the currently displayed product. For example, if the position of the currently displayed product in the video indicates the position of displayed product 1, then the predefined descriptive features of displayed product 1 are the predefined target descriptive features of the currently displayed product.
[0068] To determine whether the real-time description of the currently displayed product matches the predefined description, anomaly identification is performed on the real-time description content 202 represented by real-time description feature 204, based on the predefined description content representing target description feature 208. If the predefined description content matches the real-time description content, the anomaly identification result for the currently displayed object is determined to be "no anomaly," meaning the real-time description content conforms to the predefined description content and there is no false advertising. Conversely, if the predefined description content does not match the real-time description content, the anomaly identification result for the currently displayed object is determined to be "anomaly," meaning the real-time description content does not conform to the predefined description content and there is false advertising. In this case, the displayed product with anomaly identification results needs to undergo product review and other processing to help regulate the live-streaming e-commerce market and curb merchants' false advertising behavior. Specific processing methods are not detailed here.
[0069] The following examples illustrate this in detail: In one example, as... Figure 3 As shown, an information recognition method is provided, which is applied to... Figure 1 Taking server 104 as an example, it can be understood that this method can also be applied to terminal 102, and also to a system including terminal 102 and server 104, and is implemented through the interaction between terminal 102 and server 104. In this embodiment, the method includes the following steps:
[0070] Step 302: Extract real-time description content for the current display object from the videos that describe and display multiple display objects respectively.
[0071] The video is used to describe and display multiple objects. Each object has a display number, which is only used to locate its position and not to specify the display order. Therefore, the description and display of multiple objects can be done sequentially according to their display numbers, or not. For example, there are objects A1, A2, A3, and A4, with display numbers B1 for A1, B2 for A2, B3 for A3, and B4 for A4. In the video, the objects are ordered as A1, A2, A3, and A4, but in the actual description and display process, this order does not necessarily apply. In a live-streaming e-commerce shopping scenario, display objects A1, A2, A3, and A4 are ordered according to their display numbers. For example, display object A1 is on link 1, display object A2 is on link 2, display object A3 is on link 3, and display object A4 is on link 4. The host can introduce display object A2, which is on link 2, first, and then introduce display object A1, which is on link 1.
[0072] The aforementioned video can be a live stream video obtained during the live description and display of the object, or a historical video obtained during the live description and display of the object within a historical period; there is no limitation here. In practical applications, the object to be displayed can be a physical object or a virtual object. Physical objects can be physical goods, such as electronic products, food, and clothing, while virtual objects can be virtual items, such as game items and virtual cards. Virtual objects can also be electronic goods such as electronic vouchers and electronic coupons; examples are not exhaustively listed here.
[0073] Secondly, when the video is a live stream, the currently displayed object is the object being displayed and described. When the video is a historical video, the currently displayed object can be the object whose information needs to be identified, or any object displayed in the historical video; there are no restrictions here. The real-time description content is the content describing and displaying the currently displayed object. This real-time description content can have multiple dimensions. Considering the high reliability of the visual information displayed in the video and the textual description information describing the currently displayed object, both textual and image descriptions are used to locate the current displayed object. Therefore, the real-time description content can be either text describing the current displayed object or image content displaying the current displayed object. Based on this, the real-time description content includes at least real-time description text and real-time display images. The real-time description text can be text content described through subtitles or text content converted from audio to text. Similarly, the real-time display image can be an image obtained by taking a screenshot of the video including the currently displayed object, or an image obtained by extracting frames from the video including the currently displayed object; there are no restrictions here.
[0074] Specifically, when information identification is required, the server first acquires videos describing and displaying multiple objects separately. In the case of live object description, the server acquires the live stream videos describing and displaying multiple objects generated during the live stream through communication and interaction with the terminal conducting the live stream. In the case of scenarios requiring review of historical object descriptions, the server can acquire historical videos describing and displaying multiple objects separately from locally stored historical videos, or it can acquire historical videos describing and displaying multiple objects separately from historical time periods through communication and interaction with the terminal that described and displayed multiple objects within a historical time period. Therefore, the method of acquiring the videos, and whether the videos are live stream videos or historical videos, is not limited here.
[0075] Furthermore, the server extracts real-time descriptive content for the currently displayed object from videos that describe and display multiple objects. When the video is a live stream, the currently displayed object is the one being described. Therefore, object identification is needed within the live stream video; that is, based on confirming the existence of a displayed and described object, descriptive content is extracted from the live stream video to obtain the real-time descriptive content for the currently displayed object. When the video is a historical video, it is confirmed that the currently displayed object exists in the historical video. Therefore, descriptive content is directly extracted from the historical video to obtain the real-time descriptive content for the currently displayed object. As described above, since real-time descriptive content includes at least real-time descriptive text and real-time displayed images, the following describes the detailed real-time methods for extracting descriptive content from both text and image content extraction dimensions:
[0076] In one optional embodiment, the object description method for the currently displayed object in the video includes language description and subtitle description; the real-time description content includes at least real-time description text.
[0077] Specifically, the language description refers to the spoken description of the currently displayed object. For example, in a live-streaming e-commerce scenario, the language description is the host's spoken description of the currently displayed product. If the currently displayed product is a gold cherry pendant, the spoken description could be, "This is Y-Fei's 18K gold cherry pendant, friends. It weighs 0.16 grams and is 50 centimeters in size." Secondly, the subtitle description is the way the currently displayed object is described through subtitles in the video. The subtitles obtained can be comments posted about the currently displayed product in the comment section of the video, or subtitles describing the currently displayed product. Therefore, real-time description content includes at least real-time description text. The following details the methods for obtaining real-time description text in real-time description content.
[0078] Based on this, real-time descriptions of the current object are extracted from videos that display multiple objects through language descriptions. This includes: extracting audio from the videos that display multiple objects through language descriptions, extracting the audio describing the current object in real time, and converting the audio into text to obtain a first real-time description text for the current object; performing text recognition on the subtitles displayed in the video that describe the current object to obtain a second real-time description text for the current object; and determining the first and second real-time description texts as the real-time description text.
[0079] Specifically, considering that the video includes audio, text, and image data, the server first extracts audio from the video that displays multiple objects through language descriptions. This extracts the audio describing the current object in real-time; that is, it extracts the audio describing the current object from the video's audio data. Then, the audio is converted to text to obtain the first real-time descriptive text for the current object. At this point, the first real-time descriptive text corresponds one-to-one with the audio describing the current object. In practical applications, the text conversion of the audio describing the current object is specifically performed using Automatic Speech Recognition (ASR) technology. That is, ASR converts the audio describing the current object into the first real-time descriptive text for the current object. For example, as described in the previous example, the first real-time descriptive text for the current object is: "This is a Y-Fei 18K gold cherry pendant, friends. It weighs 0.16 grams and is 50 centimeters in size."
[0080] Furthermore, the server performs text recognition on the subtitles displayed in the video that describe the currently displayed object, obtaining a second real-time descriptive text for the object. Since the subtitles can be either comments posted in the comment section of the video or subtitles describing the object, text recognition is needed to identify both. This results in the second real-time descriptive text, which includes at least one of the comment text or the subtitle text. In practical applications, text recognition of the subtitles is performed using Automatic Speech Recognition (ASR) technology. ASR converts the audio describing the object into a first real-time descriptive text. Optical Character Recognition (OCR) converts the text of the subtitles into a black-and-white dot matrix image file, and then recognizes the text within the image file to obtain the second real-time descriptive text.
[0081] Based on this, the server determines the first real-time description text and the second real-time description text as the real-time description text. That is, the server concatenates the first and second real-time description texts obtained as described above to obtain the real-time description text of the currently displayed object. It is understandable that, since the description text includes various descriptive information that is not specific to the currently displayed object, such as in the example above, "This is Y Fei's 18K gold cherry pendant, friends. It weighs 0.16 grams and is 50 centimeters in size," the phrases "this is" and "friends" are colloquial descriptions with low relevance to the currently displayed object. Therefore, to avoid the impact of redundant descriptive information on feature extraction, attribute information can be extracted from the real-time description text.
[0082] In other words, the server can determine predefined attribute types, then extract real-time attribute information from the first and second real-time description texts using these predefined attribute types, and finally determine the obtained real-time attribute information as the real-time description text in the actual application. The aforementioned predefined attribute types are the object attributes that need to be identified. For example, predefined attribute types could include: object name, object brand, object origin, object material, object weight, object model, and object specifications. Therefore, the first and second real-time description texts can extract information from multiple attribute types, such as "object name, object brand, object origin, object material, object weight, object model, and object specifications." For example, in the previous example, "This is a Y-Fei 18K gold cherry pendant, friends. It weighs 0.16 grams and is 50 centimeters in size," the obtained object name would be "gold cherry pendant," the object brand "Y-Fei," the object origin "unknown," the object material "gold," the object weight "0.16 grams," the object model "unknown," and the object specification "50 centimeters." In other words, when extracting real-time description content, attribute information can be extracted from the first real-time description text and the second real-time description text, and the real-time attribute information can be obtained from the obtained real-time description content.
[0083] When extracting real-time attribute information, this can be achieved through an entity recognition model. Specifically, during the fine-tuning training of the entity recognition model, first and second descriptive text samples, along with attribute information samples extracted from them, are obtained. These samples are then converted into the data format required for fine-tuning the entity recognition model. For example, the data format for fine-tuning the entity recognition model might be: {"input": "This is a Y 18K gold cherry pendant, friends. It weighs 0.16 grams and is 50 centimeters in size\n### Extract the attribute information from the above sentence, including seven categories: object name, object brand, object origin, object material, object weight, object model, and object specifications:\n### Answer:\n", "output": "Object name: pendant\nObject brand: Y Fei\nObject origin: unknown\nObject material: gold\nObject weight: 0.16g\nObject model: unknown\nObject specifications: 50cm"}. Then, the entity recognition model is fine-tuned using input and output data to obtain the entity recognition model.
[0084] Secondly, in practical applications, considering that the text content of comments about the currently displayed product in the comment area of the video differs from the text obtained from audio conversion and OCR, different descriptive text types can be further subdivided. Attribute labeling can be applied to different descriptive texts to train entity recognition models corresponding to different descriptive texts. That is, during the real-time attribute information extraction process of real-time descriptive text, different entity recognition models can be used to extract real-time attribute information from real-time descriptive text obtained through different methods. Finally, all the obtained real-time attribute information is confirmed as the real-time attribute information of the currently displayed object. The method for training and fine-tuning the corresponding entity recognition models for real-time descriptive text obtained through different methods is similar to the previous example and will not be repeated here.
[0085] Since the real-time description content may also include real-time display images, the following describes how to extract real-time display images: In an optional embodiment, the real-time description content includes at least the real-time display image of the currently displayed object.
[0086] The real-time display image can be a screenshot of the video containing the currently displayed object, or it can be a frame extracted from the video containing the currently displayed object; there is no limitation here. The following details the methods for obtaining the real-time display image in the real-time description content.
[0087] Based on this, real-time descriptions of the current display object are extracted from videos that describe multiple display objects separately through language descriptions. This includes: taking screenshots of the videos that describe multiple display objects separately through language descriptions and determining the resulting video screenshots as real-time display images; or, performing frame extraction processing on the videos that describe multiple display objects separately through language descriptions and determining the resulting image video frames as real-time display images.
[0088] Specifically, after the server acquires videos that display multiple objects through language descriptions, it takes screenshots of the video frames containing the currently displayed object, thus determining the resulting screenshots as the real-time display image of the currently displayed object. Alternatively, it may first perform object recognition on the video, confirming the presence of the currently displayed object, and then take screenshots of the video frames containing the currently displayed object. Or, considering that the video can be a live stream or historical video, both are obtained from multiple consecutive image frames. In this case, it directly performs frame extraction processing on the video that displays multiple objects through language descriptions, determining the resulting image video frames as the real-time display image of the currently displayed object. It is understood that the real-time display image of the currently displayed object can be a single image or multiple images. If it is multiple images, multiple screenshots can be taken from the video frame containing the currently displayed object, or multiple frame extractions can be performed on the currently displayed object. However, in this case, it is necessary to ensure that the displayed object in all multiple frames is the same.
[0089] Step 304: Extract features from the real-time description content to obtain the real-time description features of the currently displayed object.
[0090] As described above, since real-time description content includes at least real-time descriptive text and real-time displayed images, the real-time description features of the currently displayed object must include at least the text features corresponding to the real-time descriptive text and the image features corresponding to the real-time displayed images. Specifically, the server extracts features from the real-time description content to obtain the real-time description features of the currently displayed object. Specifically, the real-time description content needs to be extracted into a feature vector, and the resulting feature vector is the real-time description feature of the currently displayed object. Considering that the real-time description content includes at least real-time descriptive text and real-time displayed images, and that the methods for feature extraction for text and images differ, a detailed explanation follows:
[0091] In one specific embodiment, based on the real-time description content including at least real-time description text and real-time displayed images, feature extraction is performed on the real-time description content to obtain the real-time description features of the currently displayed object. This includes: extracting text features from the real-time description text to obtain the text features of the currently displayed object; extracting image features from the real-time displayed images to obtain the image features of the currently displayed object; and performing feature concatenation processing on the text features and image features to obtain the real-time description features of the currently displayed object.
[0092] Specifically, the server extracts text features from the real-time description text within the real-time description content to obtain the text features of the currently displayed object. This involves tokenizing and embedding the real-time description text to obtain a description text vector, which represents the text features of the currently displayed object. As described above, the server can determine predefined attribute types and then extract real-time attribute information from the first and second real-time description texts using these predefined attribute types. The obtained real-time attribute information is then used as the real-time description text in the actual application; that is, the real-time description text is specifically real-time attribute information. Since real-time attribute information is also text information, the server performs tokenization and embedding on this information to obtain a real-time attribute feature vector. This real-time attribute feature vector represents the text features of the currently displayed object, meaning that the text features are primarily used to characterize the real-time attribute information of the currently displayed object.
[0093] Furthermore, the server extracts image features from the real-time displayed images in the real-time description content to obtain the image features of the currently displayed object. That is, the server extracts the real-time displayed image into an image feature vector using an image encoder. Considering that the real-time displayed image is a single image, feature fusion using a cross-attention module is unnecessary. Here, the server extracts the image feature vector from the real-time displayed image using the image encoder, and then maps the image feature vector to the image features of the currently displayed object using a feature mapping module. The aforementioned image encoder can be a pre-trained model, such as the Contrastive Language-Image Pre-Training (CLIP) model or the Bootstrapping Language-Image Pre-training (BLIP) model; no specific limitation is made here.
[0094] Step 306: Based on real-time description features, identify target description features predefined for the current display object from the predefined description features for each display object of the video, wherein the display objects of the video include the current display object.
[0095] Among them, the target description feature is used to characterize the predefined description content for the current display object. Considering that there is attribute information in the description text, that is, the target description feature is at least used to characterize the predefined attribute information for the current display object. For example, for a pendant of a certain brand, the predefined attribute information of the pendant is: object name: Golden Cherry Pendant, object brand: Y Fei, object origin: Country A, object material: gold, object weight: 0.15 grams, object model: ABC, object specification: 50 cm. And since the description content usually also includes display images, the target description feature can also characterize the predefined image for the current display object, and the predefined image can be composed of multiple images of the current display object.
[0096] Secondly, the display object of the video includes the current display object, that is, the current display object must be included among all the display objects that can be displayed in the video. For example, the video includes display objects A1, A2, A3, and A4, and the current display object can be display object A2 or any display object shown in the video. Based on this, since it is necessary to register the attribute information and obtain the object images for each display object before the display object is displayed and described, the predefined description feature of each display object in the video is specifically used to characterize the predefined description content of the display object, that is, the description content registered in advance. And through the above introduction, it can be seen that the description content can exist in two dimensions: display images and description text. The predefined description feature of the display object can at least characterize the display image of the display object and the attribute information in the description text of the display image.
[0097] Specifically, based on real-time description features, the server identifies predefined target description features for the current display object from the predefined description features for each display object in the video. That is, the server performs feature concatenation between the real-time description features of the current display object and the predefined description features of each display object in the video, and inputs the concatenated features into the Large Language Model (LLM). The LLM determines the target object position of the current display object in the video, and then determines the predefined description features corresponding to the target object position of the current display object in the video as the predefined target description features for the current display object. For example, the video includes display objects A1, A2, A3, and A4, with display numbers B1 for A1, B2 for A2, B3 for A3, and B4 for A4. The location indicated by display number B1 corresponds to a predefined descriptive feature C1, the location indicated by display number B2 corresponds to a predefined descriptive feature C2, the location indicated by display number B3 corresponds to a predefined descriptive feature C3, and the location indicated by display number B4 corresponds to a predefined descriptive feature C4. If the current display object is determined to be display object A2 matched by display number B2, then the location indicated by display number B2 corresponds to predefined descriptive feature C2. In this case, predefined descriptive feature C2 can be determined as the predefined target descriptive feature of the current display object.
[0098] Step 308: Based on the predefined description content representing the target description features, perform anomaly description identification on the real-time description content to obtain the anomaly description identification result for the currently displayed object.
[0099] The anomaly description identification results can be: no anomaly description, anomaly description present, or unknown attribute information. Specifically, the server performs anomaly description identification on the real-time description content based on predefined description content that characterizes the target description features, obtaining the anomaly description identification result for the currently displayed object. That is, the server first determines the predefined description content characterized by the target description features. As mentioned earlier, the predefined description content includes at least a predefined image and predefined text of the displayed object; during the description content matching process, the predefined text within the predefined description content is specifically matched. Similarly, the real-time description content includes at least a real-time display image and real-time description text of the displayed object; during the description content matching process, the real-time description text within the real-time description content is specifically matched. In other words, the server then performs text matching on the predefined text within the predefined description content and the real-time description text within the real-time description content to perform anomaly description identification, thereby obtaining the anomaly description identification result for the currently displayed object.
[0100] As can be seen from the foregoing, when processing descriptive text, entity recognition can be performed to identify the attribute information in the descriptive text. That is, predefined text can be predefined attribute information, while real-time descriptive text can be real-time attribute information. At this time, attribute information matching between predefined attribute information and real-time attribute information can also complete the abnormal description recognition, thereby obtaining the abnormal description recognition result for the currently displayed object.
[0101] It is understood that the corresponding examples in the embodiments of this application are used to understand this solution, but should not be construed as specific limitations on this solution.
[0102] In the aforementioned information recognition method, real-time descriptive features are extracted for the currently displayed object. These features represent the real-time descriptive content during the description and display of the object. By comparing the real-time descriptive features with predefined descriptive features, the predefined target descriptive features of the object are determined. This means locating the object and identifying the predefined descriptive content it is registered with. Furthermore, by comparing the real-time descriptive content with the predefined descriptive content, it is determined whether there is any descriptive content in the real-time descriptive content that does not conform to the predefined descriptive content. This process completes the identification of abnormal descriptions, thereby improving the efficiency and accuracy of information recognition.
[0103] In one embodiment, such as Figure 4 As shown, information recognition methods also include:
[0104] Step 402: Obtain target prompt information representing the display number of the displayed object in the video.
[0105] The target prompt information (Prompt) represents the display number of the displayed object in the video, and is specifically used to cause the large language model to output the display number of the current displayed object in the video. Secondly, the display number is used to locate the object's position, not to limit the display order of the object pairs. Therefore, multiple displayed objects can be described and displayed sequentially according to their display numbers, or they can be described and displayed independently of their display numbers.
[0106] Specifically, when it is necessary to locate the position of the currently displayed object, considering that the display number is used to locate the position of each corresponding display object, the server obtains the target prompt information representing the display number of the displayed object in the video. For example, the target prompt information could be "What is the display number of the currently displayed object in the video?". If applied to e-commerce live streaming, the target prompt information could be "Which product in the live stream is the currently displayed product?". The descriptive text of the target prompt information is not limited here.
[0107] Based on this, step 306, based on real-time description features, identifies target description features predefined for the current display object from predefined description features for each display object in the video, including:
[0108] Step 404: Based on real-time description features, predefined description features of each displayed object in the video, and target prompt information, determine the location of the target object that matches the currently displayed object.
[0109] Specifically, the server determines the location of the target object matching the currently displayed object based on real-time description features, predefined description features of each displayed object in the video, and target prompt information. That is, the server first performs feature processing on the target prompt information to obtain target prompt features, then concatenates the real-time description features, the predefined description features of each displayed object in the video, and the target prompt features. The concatenated features are then input into the LLM (Limited Module Management), which determines the location of the target object matching the currently displayed object. Furthermore, since the target prompt information is specifically in text form, the server can perform tokenization and embedding processing on the target prompt information to obtain a target prompt feature vector, which is the target prompt feature.
[0110] To facilitate understanding of the aforementioned process, such as Figure 5 The flowchart shown illustrates the process of determining the location of the target object. For each displayed object, the predefined text of the displayed object is tokenized and embedded to obtain its predefined text features. Additionally, an image encoder extracts image feature vectors from the predefined image of the displayed object. Considering that the predefined image is usually composed of multiple images, a cross-attention module is needed to fuse the image feature vectors from these multiple images to obtain the predefined image features of the displayed object. Thus, for multiple displayed objects, multiple combinations of predefined text features and predefined image features can be obtained.
[0111] Similarly, the real-time description text of the displayed object is tokenized and embedded to obtain its text features. An image encoder extracts the real-time display image of the object into an image feature vector. Since the real-time display image is a single image, feature fusion using a cross-attention module is unnecessary. The image encoder extracts the image feature vector, which is then mapped to the image features of the currently displayed object using a feature mapping module. Finally, the target prompt information is tokenized and embedded to obtain its target prompt features.
[0112] Therefore, by inputting multiple sets of predefined text features, predefined image features, text features and image features of the currently displayed object, and target prompt features into the large language model, the large language model outputs the location of the target object that matches the currently displayed object. For example, taking e-commerce live streaming as an example, if the target prompt is "Which product is the currently displayed product in the live stream?", and the currently displayed object can be displayed object A2, then the output target object location can be "The currently displayed product is the 2nd displayed product", or simply "The 2nd".
[0113] Step 406: Determine the predefined descriptive features of the displayed object corresponding to the target object location as the predefined target descriptive features for the current displayed object.
[0114] Specifically, the server determines the predefined descriptive features of the displayed object corresponding to the target object's location as the predefined target descriptive features for the current displayed object. Since attribute information registration and object image acquisition are required for each displayed object before information recognition, the predefined descriptive features of each displayed object in the video specifically characterize the predefined descriptive content of the displayed object, that is, the pre-registered descriptive content. Furthermore, each displayed object has a predefined location in the video, meaning each displayed object has a corresponding display number, which indicates that the displayed object exists in a predefined location in the video. Considering that displayed objects have corresponding predefined descriptive features, and that the display number can essentially locate the displayed object at a predefined location, the server can determine the corresponding displayed object through the target object's location and determine the predefined descriptive features of that displayed object. Then, it determines the predefined descriptive features as the predefined target descriptive features for the current displayed object.
[0115] For ease of understanding, the video includes display objects A1, A2, A3, and A4. Display object A1 has a display number B1, display object A2 has a display number B2, display object A3 has a display number B3, and display object A4 has a display number B4. That is, display number B1 indicates that display object A1 is in the first position in the video, display number B2 indicates that display object A2 is in the second position, display number B3 indicates that display object A3 is in the third position, and display number B4 indicates that display object A4 is in the fourth position. If the target object is determined to be in the second position, meaning display number B2 corresponds to display object A2, and the predefined descriptive feature of display object A2 is specifically predefined descriptive feature C2, then it can be determined that predefined descriptive feature C2 is a predefined target descriptive feature for the current display object.
[0116] It is understood that the corresponding examples in the embodiments of this application are used to understand this solution, but should not be construed as specific limitations on this solution.
[0117] In this embodiment, the target prompt information can be used to locate the currently displayed object in the video. That is, by determining the position of the target object, the position of the currently displayed object in the video corresponding to the registered display object can be accurately located, thereby providing accurate descriptive features for subsequent information recognition. This ensures that the predefined descriptive content represented by the descriptive features is the actual registration information describing the currently displayed object, further improving the reliability and accuracy of information recognition.
[0118] In one embodiment, such as Figure 6 As shown, based on predefined description content representing target description features, anomaly description identification is performed on real-time description content to obtain anomaly description identification results for the currently displayed object, including:
[0119] Step 602: Determine the predefined descriptive content used to characterize the target descriptive features.
[0120] Specifically, the server determines the predefined descriptive content used to characterize the target descriptive features. As explained above, since the target descriptive features are specifically predefined descriptive features of the currently displayed object, and these predefined descriptive features are obtained by feature processing of predefined images and predefined text, the predefined images and predefined text represented by the current displayed object can be determined through the predefined descriptive features of the current displayed object. Furthermore, as explained above, considering that the descriptive text contains attribute information, meaning the target descriptive features at least characterize the predefined attribute information predefined for the current displayed object, the predefined attribute information represented by the predefined descriptive features can also be directly determined at this point.
[0121] It is understood that, in the aforementioned embodiments, the location of the target object matching the current display object can be determined based on real-time description features, predefined description features of each displayed object in the video, and target prompt information. Therefore, after determining the location of the target object, the predefined description content of the corresponding display object can also be determined based on the target object location. For example, if the target object location is determined to be the second position, that is, if the second-position display number B2 corresponds to display object A2, then the predefined description content of display object A2 is the predefined description content of the display object.
[0122] Step 604: Extract predefined attribute information from the predefined description content and extract real-time attribute information from the real-time description content.
[0123] Specifically, the server extracts predefined attribute information from the predefined description content. As described above, the predefined description content includes predefined images and predefined text. When the predefined text includes not only predefined attribute information, entity recognition, as described in the previous embodiments, needs to be performed on the predefined text to extract the predefined attribute information. If the predefined text is already the text obtained after entity recognition as described in the previous embodiments, then the predefined text itself is the predefined attribute information, and it can be identified as such. Furthermore, the predefined attribute information in the predefined text is specifically used to characterize the attributes of the displayed object.
[0124] Similarly, the server extracts real-time attribute information from the real-time description content. As described above, the real-time description content includes real-time displayed images and real-time descriptive text. If the real-time descriptive text includes not only real-time attribute information, entity recognition, as described in the previous embodiments, needs to be performed on the real-time descriptive text to extract the real-time attribute information. Since the real-time descriptive text is already the text obtained after entity recognition as described in the previous embodiments, then the real-time descriptive text itself is the real-time attribute information, and it can be determined as such. The real-time attribute information in the real-time descriptive text is used to characterize the real-time descriptive attributes of the displayed object. Furthermore, as described above, real-time attribute information can also be extracted from the first and second real-time descriptive texts by determining predefined attribute types; that is, both real-time attribute information and predefined attribute information need to be extracted based on predefined attribute types.
[0125] If the real-time description text is already the text obtained after entity recognition as described in the preceding embodiments, then the real-time description text is essentially real-time attribute information, and entity recognition is no longer required. However, if the real-time description text includes not only real-time attribute information, then entity recognition is necessary for the real-time description text. This is described below: In an optional embodiment, the information recognition method further includes: obtaining predefined attribute types that describe objects from multiple description dimensions. The aforementioned predefined attribute types are the object attributes to be identified. For example, predefined attribute types include: object name, object brand, object origin, object material, object weight, object model, and object specifications, etc.
[0126] Specifically, the server obtains predefined attribute types that describe objects from multiple descriptive dimensions. The server can obtain predefined attribute types through locally stored data information, or through a communication connection with a terminal used to select predefined attribute types.
[0127] Based on this, predefined attribute information is extracted from predefined description content, and real-time attribute information is extracted from real-time description content, including: extracting attributes from predefined description content according to multiple predefined attribute types to obtain predefined attribute information; the predefined attribute information includes attribute information of multiple predefined attribute types; extracting attributes from real-time description content according to multiple predefined attribute types to obtain real-time attribute information; the real-time attribute information includes attribute information of multiple predefined attribute types.
[0128] Specifically, the server extracts attributes from the predefined description content according to multiple predefined attribute types to obtain predefined attribute information. Therefore, the obtained predefined attribute information includes attribute information of multiple predefined attribute types. That is, it considers the case that the predefined text not only includes predefined attribute information. In this case, it is necessary to extract attributes from the predefined description content according to multiple predefined attribute types, which is to perform entity recognition and extract attribute information belonging to multiple predefined attribute types from the predefined description content. As in the example above, for a pendant of a certain brand, the predefined attribute information of the pendant is: Object Name: Golden Cherry Pendant, Object Brand: Y Fei, Object Origin: Country A, Object Material: Gold, Object Weight: 0.15 grams, Object Model: ABC, Object Size: 50 centimeters.
[0129] Similarly, the server extracts attributes from the real-time description content according to multiple predefined attribute types to obtain real-time attribute information. Therefore, the obtained real-time attribute information includes attribute information of multiple predefined attribute types. That is, considering the case where the real-time description text includes not only real-time attribute information, it is necessary to extract attributes from the real-time description content according to multiple predefined attribute types, which is entity recognition, extracting attribute information belonging to multiple predefined attribute types from the real-time description content. As in the previous example, for a pendant of a certain brand, the real-time attribute information of the pendant is: Object Name: Golden Cherry Pendant, Object Brand: Y Fei, Object Origin: Unknown, Object Material: Gold, Object Weight: 0.16 grams, Object Model: Unknown, Object Size: 50 centimeters.
[0130] Step 606: Based on predefined attribute information and real-time attribute information, perform anomaly description identification on the real-time attribute information to obtain the anomaly description identification result for the currently displayed object.
[0131] Specifically, the server performs anomaly description identification on the real-time attribute information based on predefined attribute information and real-time attribute information, obtaining anomaly description identification results for the currently displayed object. In other words, the server matches the predefined attribute information with the real-time attribute information to determine if they are consistent. As mentioned earlier, the attribute information includes attribute information belonging to multiple predefined attribute types; that is, attribute information under each predefined attribute type can be matched accordingly. Therefore, the obtained anomaly description identification results can include attribute anomaly description results corresponding to attribute information under predefined attribute types. The following sections describe different scenarios:
[0132] In one specific embodiment, based on predefined attribute information and real-time attribute information, anomaly description identification is performed on the real-time attribute information to obtain anomaly description identification results for the currently displayed object, including: if the predefined attribute information and the real-time attribute information are consistent, the anomaly description identification result for the currently displayed object is determined to be no anomaly description; if the predefined attribute information and the real-time attribute information are inconsistent, the anomaly description identification result for the currently displayed object is determined to be an anomaly description.
[0133] Specifically, as can be seen from the aforementioned embodiments, the server performs attribute information matching on the predefined attribute information and the real-time attribute information to determine whether the predefined attribute information and the real-time attribute information are consistent. If the predefined attribute information and the real-time attribute information are consistent, it means that the attribute information of multiple predefined attribute types in the predefined attribute information is consistent with the attribute information of multiple predefined attribute types in the real-time attribute information. At this time, it means that the real-time description of the currently displayed object is consistent with the actual situation. Therefore, the abnormal description identification result of the currently displayed object is determined to be no abnormal description.
[0134] Similarly, the server performs attribute information matching between predefined attribute information and real-time attribute information to determine if they are consistent. If they are inconsistent, it means that at least one of the attribute information for multiple predefined attribute types in the predefined attribute information is inconsistent with the attribute information for multiple predefined attribute types in the real-time attribute information. This indicates that the real-time description of the currently displayed object is partially inconsistent with reality, thus confirming that an abnormal description exists for the currently displayed object. Furthermore, because of the inconsistency between predefined and real-time attribute information, the server can further identify the inconsistent predefined attribute types and classify them as abnormal attribute information. In other words, the real-time description of these abnormal attribute types is inconsistent with reality.
[0135] Furthermore, considering that in practical applications, some predefined attribute types may not be described during real-time description, resulting in empty values for the corresponding attribute information during entity recognition. Additionally, predefined attributes may not have been pre-defined during registration, also leading to empty values for the corresponding attribute information. This situation is described below:
[0136] In one alternative embodiment, the anomaly description identification result may also include unknown attribute information.
[0137] Based on this, based on predefined attribute information and real-time attribute information, anomaly description identification is performed on the real-time attribute information to obtain anomaly description identification results for the currently displayed object, including: identifying attribute information of predefined attribute types with null values in the predefined attribute information as first unknown attribute information; identifying attribute information of predefined attribute types with null values in the real-time attribute information as second unknown attribute information; and identifying attribute information of the matching predefined attribute type as unknown attribute information when either first or second unknown attribute information exists.
[0138] The first unknown attribute information refers to attribute information of a predefined attribute type whose attribute information is null in the predefined attribute information, and the second unknown attribute information refers to attribute information of a predefined attribute type whose attribute information is null in the real-time attribute information. Specifically, the server performs a null value query on the attribute information of the predefined attribute type included in the predefined attribute information. A null attribute information is defined as either not having been assigned a value or being in an unknown state. Therefore, the attribute information of the predefined attribute type that is not assigned a value or is in an unknown state is identified as the first unknown attribute information.
[0139] Similarly, the server performs null value queries on attribute information of real-time attribute types included in the real-time attribute information. Null value attribute information specifically means: the attribute information has not been assigned a value, or the attribute information is in an unknown state. Therefore, attribute information of real-time attribute types that has not been assigned a value or is in an unknown state is identified as the second unknown attribute information. For example, for a pendant of a certain brand, the pendant's real-time attribute information is: Object Name: Golden Cherry Pendant, Object Brand: Y-Fei, Object Origin: Unknown, Object Material: Gold, Object Weight: 0.16 grams, Object Model: Unknown, Object Size: 50 centimeters. In this case, the object origin and object model are both unknown (i.e., null values) in the real-time attribute information. Therefore, the object origin and object model can be identified as the second unknown attribute information.
[0140] Further, in the case where there is the first unknown attribute information or the second unknown attribute information, that is, there is the first unknown attribute information, or there is the second unknown attribute information, or both the first unknown attribute information and the second unknown attribute information exist, it is determined that there is attribute information that has not been assigned a definition. Therefore, the server determines the attribute information of the predefined attribute type that is matched as unknown attribute information, and the unknown attribute information can specifically carry the predefined attribute type. For example, as in the previous example, the object origin and the object model are determined as the second unknown attribute information, that is, at this time, the object origin and the object model are determined as unknown attribute information.
[0141] Since the anomaly description recognition result can be: no anomaly description, there is an anomaly description, and unknown attribute information, that is, different attribute anomaly description results can be output for different predefined attribute types. At this time, the attribute anomaly description result has a matching predefined attribute type, and the attribute anomaly description result is used to characterize the anomaly recognition of the predefined attribute type in the predefined attribute information and the real-time attribute information. That is, the anomaly description recognition result can include the attribute anomaly description result under each predefined attribute type. For ease of understanding, the obtained predefined attribute information is: object name: Golden Cherry Pendant, object brand: Y Fei, object origin: Country A, object material: gold, object weight: 0.15 grams, object model: ABC, object specification: 50 cm. And the obtained real-time attribute information is: object name: Golden Cherry Pendant, object brand: Y Fei, object origin: unknown, object material: gold, object weight: 0.16 grams, object model: unknown, object specification: 50 cm.
[0142] At this time, it is necessary to perform attribute information matching for different predefined attribute types respectively, that is, for the predefined attribute information and the real-time attribute information, perform attribute information matching according to the object name, object brand, object origin, object material, object weight, object model, and object specification. It can be seen from this that for the object name, both the predefined attribute information and the real-time attribute information are described as "Golden Cherry Pendant". Therefore, the attribute information of the predefined attribute information and the real-time attribute information is consistent under the object name, that is, the attribute anomaly description result of the object name is no anomaly description. Similarly, for the object brand, both the predefined attribute information and the real-time attribute information are described as "Y Fei". Therefore, the attribute information of the predefined attribute information and the real-time attribute information is consistent under the object name, that is, the attribute anomaly description result of the object name is no anomaly description.
[0143] Similarly, regarding the object's origin, the predefined attribute information describes it as "Country A," while the real-time attribute information describes it as "Unknown." This means there is a second unknown attribute information under the object's origin, therefore the attribute anomaly description result for the object's origin is "Unknown attribute information." Regarding the object's material, both the predefined and real-time attribute information are described as "Gold." Therefore, the predefined and real-time attribute information are consistent under the object's material, meaning the attribute anomaly description result for the object's material is "No anomaly description." Regarding the object's weight, the predefined attribute information is described as "0.15 grams," while the real-time attribute information is described as "0.16 grams." This means the predefined and real-time attribute information are inconsistent under the object's weight, meaning the attribute anomaly description result for the object's weight is "An anomaly description exists."
[0144] For the object model, the predefined attribute information is described as "ABC", while the real-time attribute information is described as "unknown". This means that there is a second unknown attribute information under the object model. Therefore, the attribute anomaly description result for the object model is unknown attribute information. Similarly, for the object specification, both the predefined and real-time attribute information are described as "50 cm". Therefore, the attribute information for the predefined and real-time attribute information is consistent under the object specification, meaning the attribute anomaly description result for the object specification is no anomaly description.
[0145] It is understood that the corresponding examples in the embodiments of this application are used to understand this solution, but should not be construed as specific limitations on this solution.
[0146] In this embodiment, predefined attribute information is extracted from predefined description content, and real-time attribute information is extracted from real-time description content. This avoids redundant information about the displayed object in the description content. By extracting attribute information, the true attributes of the displayed object can be more accurately described. The predefined attribute information and real-time attribute information are compared to determine whether there are any mismatches between the real-time and predefined attribute information, thus achieving more refined and accurate anomaly identification and improving the accuracy of information recognition. Furthermore, multiple predefined attribute types for object description are considered, allowing for more refined anomaly identification across different description dimensions, further improving the accuracy of information recognition.
[0147] In one embodiment, such as Figure 7 As shown, the methods for obtaining predefined descriptive features include:
[0148] Step 702: Obtain the predefined image and predefined text of the object to be displayed.
[0149] The predefined image is a display image predefined for the current display object. In other words, the predefined image is used to comprehensively and multi-dimensionally display the display object. Therefore, a predefined image can consist of multiple images for the current display object. Multiple predefined sub-images belonging to the same predefined image are all images displaying the same display object. Secondly, the predefined text is text pre-registered for the display object. That is, the predefined text includes at least predefined attribute information for the display object. The aforementioned predefined attribute information is not fixed in specific predefined attribute types and can be predefined based on the actual scenario requirements.
[0150] Specifically, the server obtains the predefined image and predefined text of the object to be displayed. That is, the server constructs the predefined image of the object based on multiple predefined sub-images that provide a comprehensive and multi-faceted display of the object. Then, it determines the text pre-registered for the object and designates this predefined text as the predefined text of the object. It can be understood that the server can also designate the pre-registered text for the object as the initial description text, and then extract predefined attribute information from the initial description text according to predefined attribute types, thus obtaining predefined text that only includes predefined attribute information.
[0151] Step 704: Extract image features from the predefined image to obtain the predefined image features of the displayed object, and extract text features from the predefined text to obtain the predefined text features of the displayed object.
[0152] Specifically, the server extracts image features from a predefined image to obtain predefined image features of the displayed object. That is, the server uses an image encoder to extract predefined image feature vectors from the predefined image, and then determines the predefined image features of the displayed object based on these predefined image feature vectors. The aforementioned image encoder can be a pre-trained model, such as CLIP or BLIP, etc., and is not limited here. Similarly, the server extracts predefined text features from predefined text to obtain predefined text features of the displayed object. That is, the server performs tokenization and embedding processing on the predefined text to obtain predefined text vectors, which are the predefined text features of the displayed object.
[0153] Since predefined images are usually multiple images, feature fusion of multiple images should be considered when performing image feature extraction. This will be introduced below:
[0154] In one specific embodiment, image feature extraction is performed on a predefined image to obtain predefined image features of the display object, including: extracting image features from multiple predefined sub-images included in the predefined image to obtain predefined sub-image features of each predefined sub-image; and performing feature fusion processing on the features of each predefined sub-image to obtain predefined image features of the display object.
[0155] Specifically, the server extracts image features from each of the multiple predefined sub-images included in the predefined image, obtaining the predefined sub-image features of each predefined sub-image. Then, it performs feature fusion processing on these predefined sub-image features to obtain the predefined image features of the displayed object. In other words, the number of predefined sub-images for a displayed object is not fixed; that is, after extracting image features from different predefined sub-images, there will be several predefined sub-image features. Therefore, a cross-attention module is needed to fuse the features of multiple predefined sub-images with a variable number of features. Thus, the server extracts features from multiple predefined sub-images separately using an image encoder, obtaining the predefined sub-image feature vector corresponding to each predefined sub-image. At this point, the cross-attention module needs to fuse the image feature vectors of multiple images, that is, to perform feature fusion processing on the features of each predefined sub-image, thereby obtaining the predefined image features of the displayed object.
[0156] Step 706: Perform feature concatenation processing on the predefined image features and the predefined image features to obtain the predefined descriptive features of the displayed object.
[0157] Specifically, the server performs feature concatenation processing on predefined image features to obtain predefined descriptive features of the displayed object. That is, for each displayed object, the predefined image features of each displayed object are concatenated to obtain the predefined descriptive features of each displayed object.
[0158] It is understood that the corresponding examples in the embodiments of this application are used to understand this solution, but should not be construed as specific limitations on this solution.
[0159] In this embodiment, the description content is obtained through two dimensions: description text and displayed image. Obtaining the description content through multiple description dimensions can ensure the completeness and reliability of the content. Therefore, different features are extracted for the description text and displayed image to ensure that the extracted features can more accurately describe the information carried in the text and image. The predefined description features obtained by splicing are more complete and reliable. Based on complete and reliable predefined description features, more reliable object positioning can be performed, thereby improving the reliability of subsequent information acquisition and recognition.
[0160] Based on the detailed description of the foregoing embodiments, the complete process of the information recognition method in the embodiments of this application will be described below. In one embodiment, such as Figure 8 As shown, an information recognition method is provided, which is applied to... Figure 1 Taking server 104 as an example, it can be understood that this method can also be applied to terminal 102, and also to a system including terminal 102 and server 104, and is implemented through the interaction between terminal 102 and server 104. In this embodiment, the method includes the following steps:
[0161] Step 801: Extract audio from the video that displays multiple objects through language descriptions, extract the audio describing the current object in real time, and convert the audio into text to obtain the first real-time description text for the current object; perform text recognition on the subtitles displayed in the video for the current object to obtain the second real-time description text for the current object; and determine the first real-time description text and the second real-time description text as the real-time description text.
[0162] Step 802: Extract text features from the real-time description text to obtain the text features of the currently displayed object.
[0163] Step 803: Take screenshots of the videos that display multiple display objects through language descriptions, and determine the obtained video screenshots as real-time display images; or, perform image frame extraction processing on the videos that display multiple display objects through language descriptions, and determine the obtained image video frames as real-time display images.
[0164] Step 804: Extract image features from the real-time displayed image to obtain the image features of the currently displayed object.
[0165] Step 805: Perform feature concatenation processing on the text features and image features to obtain the real-time descriptive features of the currently displayed object.
[0166] Step 806: Obtain target prompt information representing the display number of the displayed object in the video.
[0167] Step 807: Based on real-time description features, predefined description features of each displayed object in the video, and target prompt information, determine the location of the target object matching the current displayed object; wherein, the predefined description features are obtained by: acquiring a predefined image and predefined text of the displayed object; extracting image features from the predefined image to obtain the predefined image features of the displayed object, and extracting text features from the predefined text to obtain the predefined text features of the displayed object; and performing feature concatenation processing on the predefined image features and the predefined text features to obtain the predefined description features of the displayed object.
[0168] Step 808: Determine the predefined descriptive features of the displayed object corresponding to the target object location as the predefined target descriptive features for the current displayed object.
[0169] Step 809: Determine the predefined descriptive content used to characterize the target descriptive features.
[0170] Step 810: Extract predefined attribute information from the predefined description content and extract real-time attribute information from the real-time description content.
[0171] Step 811: Based on predefined attribute information and real-time attribute information, perform anomaly description identification on the real-time attribute information to obtain the anomaly description identification result for the currently displayed object.
[0172] It should be understood that the specific implementation methods of steps 801 to 811 are similar to those of the aforementioned embodiments, and will not be repeated here.
[0173] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps.
[0174] Based on the same inventive concept, this application also provides an information identification device for implementing the information identification method described above. The solution provided by this device is similar to the solution described in the above method; therefore, the specific limitations in one or more information identification device embodiments provided below can be found in the limitations of the information identification method described above, and will not be repeated here.
[0175] In one embodiment, such as Figure 9 As shown, an information recognition device is provided, including: a content extraction module 902, a feature extraction module 904, a feature recognition module 906, and an information recognition module 908, wherein:
[0176] The content extraction module 902 is used to extract real-time descriptive content for the current display object from videos that describe and display multiple display objects respectively;
[0177] The feature extraction module 904 is used to extract features from the real-time description content to obtain the real-time description features of the currently displayed object;
[0178] The feature recognition module 906 is used to identify, based on real-time description features, a target description feature predefined for the current display object from predefined description features for each display object in the video, wherein the display objects in the video include the current display object;
[0179] The information recognition module 908 is used to identify abnormal descriptions of real-time description content based on predefined description content that represents the characteristics of the target description, and to obtain the abnormal description recognition result for the currently displayed object.
[0180] In one embodiment, the real-time description content includes at least real-time description text and real-time displayed images;
[0181] The feature extraction module is specifically used to extract text features from real-time descriptive text to obtain the text features of the currently displayed object; to extract image features from real-time displayed images to obtain the image features of the currently displayed object; and to perform feature concatenation processing on the text features and image features to obtain the real-time descriptive features of the currently displayed object.
[0182] In one embodiment, the object description method for the currently displayed object in the video includes language description and subtitle description; the real-time description content includes at least real-time description text;
[0183] The feature extraction module is specifically used to extract audio from videos that display multiple objects through language descriptions, extract the audio describing the current object in real time, and convert the audio into text to obtain the first real-time description text for the current object; perform text recognition on the subtitles displayed in the video for the current object to obtain the second real-time description text for the current object; and determine the first real-time description text and the second real-time description text as the real-time description text.
[0184] In one embodiment, the real-time description content includes at least a real-time display image of the currently displayed object;
[0185] The feature extraction module is specifically used to take screenshots of videos that display multiple objects through language descriptions, and determine the obtained video screenshots as real-time display images; or, to perform image frame extraction processing on videos that display multiple objects through language descriptions, and determine the obtained image video frames as real-time display images.
[0186] In one embodiment, the information recognition device further includes a prompt information acquisition module;
[0187] The prompt information acquisition module is used to acquire target prompt information that represents the display number of the displayed object in the video;
[0188] The feature recognition module is specifically used to determine the location of the target object that matches the current display object based on real-time description features, predefined description features of each displayed object in the video, and target prompt information; and to determine the predefined description features of the display object corresponding to the target object location as the predefined target description features for the current display object.
[0189] In one embodiment, the information recognition module is specifically used to determine predefined description content used to characterize the features of the target description; extract predefined attribute information from the predefined description content and extract real-time attribute information from the real-time description content; and perform abnormal description recognition on the real-time attribute information based on the predefined attribute information and the real-time attribute information to obtain the abnormal description recognition result for the currently displayed object.
[0190] In one embodiment, the information recognition module is specifically used to determine that the abnormal description recognition result of the currently displayed object is no abnormal description if the predefined attribute information is consistent with the real-time attribute information; and to determine that the abnormal description recognition result of the currently displayed object is that an abnormal description exists if the predefined attribute information is inconsistent with the real-time attribute information.
[0191] In one embodiment, the information identification device further includes an attribute type acquisition module;
[0192] The attribute type retrieval module is used to retrieve predefined attribute types that describe objects from multiple descriptive dimensions.
[0193] The information recognition module is specifically used to extract attributes from predefined description content according to multiple predefined attribute types to obtain predefined attribute information; the predefined attribute information includes attribute information of multiple predefined attribute types; and to extract attributes from real-time description content according to multiple predefined attribute types to obtain real-time attribute information; the real-time attribute information includes attribute information of multiple predefined attribute types.
[0194] In one embodiment, the anomaly description identification result also includes unknown attribute information;
[0195] The information recognition module is specifically used to identify the attribute information of the predefined attribute type that has an empty value in the predefined attribute information as the first unknown attribute information; to identify the attribute information of the predefined attribute type that has an empty value in the real-time attribute information as the second unknown attribute information; and to identify the attribute information of the matching predefined attribute type as unknown attribute information when the first unknown attribute information or the second unknown attribute information exists.
[0196] In one embodiment, the information identification device further includes a descriptive feature acquisition module;
[0197] The descriptive feature acquisition module is used to acquire a predefined image and predefined text of the display object; to extract image features from the predefined image to obtain the predefined image features of the display object, and to extract text features from the predefined text to obtain the predefined text features of the display object; and to perform feature concatenation processing on the predefined image features and the predefined text features to obtain the predefined descriptive features of the display object.
[0198] In one embodiment, the feature acquisition module is specifically used to extract image features from multiple predefined sub-images included in the predefined image to obtain the predefined sub-image features of each predefined sub-image; and to perform feature fusion processing on the features of each predefined sub-image to obtain the predefined image features of the displayed object.
[0199] In one embodiment, a computer device is provided, which can be a server or a terminal. This embodiment uses a server as an example for description, and its internal structure diagram is as follows: Figure 10 As shown, the computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage medium. The database stores data related to the embodiments of this application, such as video, predefined descriptive features, and predefined descriptive content. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When executed by the processor, the computer program implements an information recognition method.
[0200] Those skilled in the art will understand that Figure 10 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0201] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0202] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.
[0203] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0204] It should be noted that the information identification (including but not limited to object device information, object personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the object or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0205] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based information recognition logic devices, etc., and are not limited to these.
[0206] The technical features in the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0207] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for information identification, characterized in that, The method includes: Extract real-time descriptions of the current object from videos that describe and display multiple objects separately; Feature extraction is performed on the real-time description content to obtain the real-time description features of the currently displayed object; Based on the real-time description features, target description features predefined for the current display object are identified from the predefined description features for each display object of the video, wherein the display objects of the video include the current display object; Based on predefined description content that characterizes the target description features, abnormal description identification is performed on the real-time description content to obtain abnormal description identification results for the currently displayed object.
2. The method according to claim 1, characterized in that, The real-time description content includes at least real-time description text and real-time displayed images; The step of extracting features from the real-time description content to obtain the real-time description features of the currently displayed object includes: Text feature extraction is performed on the real-time description text to obtain the text features of the currently displayed object; Image features are extracted from the real-time displayed image to obtain the image features of the currently displayed object; The text features and the image features are concatenated to obtain the real-time descriptive features of the currently displayed object.
3. The method according to claim 2, characterized in that, The video describes the object of the currently displayed object using both language description and subtitle description; the real-time description content includes at least real-time description text. The step of extracting real-time descriptive content for the current display object from videos that respectively showcase multiple display objects through language descriptions includes: Audio is extracted from videos that display multiple objects through language descriptions. The audio describing the current object is extracted in real time, and the audio is converted into text to obtain a first real-time description text for the current object. Text recognition is performed on the subtitles displayed in the video for the currently displayed object to obtain a second real-time description text for the currently displayed object; The first real-time description text and the second real-time description text are determined as the real-time description text.
4. The method according to claim 2, characterized in that, The real-time description content includes at least the real-time display image of the currently displayed object; The step of extracting real-time descriptive content for the current display object from videos that respectively showcase multiple display objects through language descriptions includes: Take screenshots of videos that describe multiple objects using language, and use these screenshots as the real-time display images. Alternatively, perform image frame extraction processing on the video that describes multiple display objects through language, and determine the resulting image video frame as the real-time display image.
5. The method according to claim 1, characterized in that, The method further includes: Obtain target prompt information representing the display number of the displayed object in the video; The step of identifying a predefined target description feature for the current display object from predefined description features for each display object in the video based on the real-time description features includes: Based on the real-time description features, the predefined description features of each displayed object in the video, and the target prompt information, the location of the target object matching the current displayed object is determined; The predefined descriptive features of the displayed object corresponding to the target object location are determined as the predefined target descriptive features for the currently displayed object.
6. The method according to claim 1, characterized in that, The method of identifying abnormal descriptions of the real-time description content based on predefined description content characterizing the target description features, and obtaining abnormal description identification results for the currently displayed object, includes: Determine the predefined descriptive content used to characterize the descriptive features of the target; Extract predefined attribute information from the predefined description content, and extract real-time attribute information from the real-time description content; Based on the predefined attribute information and the real-time attribute information, anomaly description identification is performed on the real-time attribute information to obtain anomaly description identification results for the currently displayed object.
7. The method according to claim 6, characterized in that, The step of performing anomaly description identification on the real-time attribute information based on the predefined attribute information and the real-time attribute information to obtain anomaly description identification results for the currently displayed object includes: If the predefined attribute information is consistent with the real-time attribute information, the abnormal description identification result of the currently displayed object is determined to be no abnormal description. If the predefined attribute information is inconsistent with the real-time attribute information, the abnormal description identification result of the currently displayed object is determined to be that an abnormal description exists.
8. The method according to claim 6, characterized in that, The method further includes: Retrieve predefined property types that describe an object from multiple descriptive dimensions; The step of extracting predefined attribute information from the predefined description content and extracting real-time attribute information from the real-time description content includes: From the predefined description content, attributes are extracted according to multiple predefined attribute types to obtain predefined attribute information; the predefined attribute information includes attribute information of multiple predefined attribute types. Attributes are extracted from the real-time description content according to multiple predefined attribute types to obtain real-time attribute information; the real-time attribute information includes attribute information of multiple predefined attribute types.
9. The method according to claim 8, characterized in that, The anomaly description identification results also include unknown attribute information; Based on the predefined attribute information and the real-time attribute information, anomaly description identification is performed on the real-time attribute information to obtain anomaly description identification results for the currently displayed object, including: The attribute information of the predefined attribute type that has an empty value in the predefined attribute information is determined as the first unknown attribute information; Attribute information of predefined attribute types with null values in real-time attribute information is identified as the second unknown attribute information; If either the first unknown attribute information or the second unknown attribute information exists, the attribute information of the matched predefined attribute type is determined as the unknown attribute information.
10. The method according to claim 1, characterized in that, The methods for obtaining the predefined descriptive features include: Obtain the predefined image and predefined text of the displayed object; Image feature extraction is performed on the predefined image to obtain the predefined image features of the display object, and text feature extraction is performed on the predefined text to obtain the predefined text features of the display object; The predefined image features are combined with the predefined image features to obtain the predefined descriptive features of the displayed object.
11. The method according to claim 10, characterized in that, The step of extracting image features from the predefined image to obtain the predefined image features of the displayed object includes: Image feature extraction is performed on each of the multiple predefined sub-images included in the predefined image to obtain the predefined sub-image features of each predefined sub-image; The predefined sub-image features are fused to obtain the predefined image features of the displayed object.
12. An information identification device, characterized in that, The device includes: The content extraction module is used to extract real-time descriptive content for the current display object from videos that describe and display multiple display objects respectively; The feature extraction module is used to extract features from the real-time description content to obtain the real-time description features of the currently displayed object; The feature recognition module is used to identify, based on the real-time description features, a target description feature predefined for the current display object from the predefined description features for each display object of the video, wherein the display objects of the video include the current display object; The information recognition module is used to identify abnormal descriptions of the real-time description content based on predefined description content that characterizes the target description features, and to obtain the abnormal description recognition result for the currently displayed object.
13. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 11.
14. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 11.
15. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the method described in any one of claims 1 to 11.