Method and device for extracting pictures from webpage text

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
An extraction method and image technology, applied in the computer field, can solve problems affecting user experience, unclear theme of webpage information, low correlation between images and webpage text, and achieve the effects of improving user experience, facilitating search for images, and clear themes

Active Publication Date: 2016-04-06

TENCENT TECH (SHENZHEN) CO LTD

View PDF5 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] Extracting the first picture in the webpage text or randomly extracting a picture from all picture nodes as a picture representing the webpage may extract pictures that have nothing to do with the content of the webpage text, resulting in a low correlation between the extracted pictures and the webpage text , making the theme of the webpage information represented by the picture unclear, which affects the user experience

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0063] see figure 1 , the present embodiment provides a method for extracting pictures in the text of a web page, and the process of the method is as follows:

[0064] 101: Obtain an image node in the webpage text, and obtain text description information of the image node.

[0065] 102: According to the text description information of the image node, extract the image most relevant to the webpage text from the acquired image nodes.

[0066] Wherein, according to the text description information of the image node, the image most relevant to the webpage text is extracted from the obtained image node, including:

[0067] Calculate the similarity between the text description information and the webpage title of the webpage text;

[0068] From the image nodes whose similarity is greater than or equal to a preset threshold, the image with the largest similarity is extracted.

[0069] Specifically, obtain the text description information of the image node, including:

[0070] Obt...

Embodiment 2

[0087] see figure 2 , the present embodiment provides a method for extracting pictures in the text of a web page, and the process of the method is as follows:

[0088] 201: Obtain image nodes in the webpage text.

[0089] In this embodiment, the picture node is a part of the webpage text, and usually the webpage text includes pictures and text, etc. In order to facilitate picture extraction, the webpage text can be divided in advance to obtain picture nodes and text nodes. Specifically, the webpage may be divided through a DOM (Document Object Model, Document Object Model) tree of the webpage, and of course other methods may also be used, which is not limited in the present invention. Correspondingly, the node features of the DOM tree can be used to obtain the image nodes in the webpage text, which will not be described in detail here.

[0090] 202: Obtain attribute information of the aforementioned image node.

[0091] The attribute information of the picture node include...

Embodiment 3

[0119] see image 3 , the present embodiment provides a method for extracting pictures in the text of the webpage. The difference with Embodiment 2 is that in this embodiment, the pictures in the text of the webpage are extracted by the length of the URL of the picture node. The process of the method is specifically as follows:

[0120] 301: Obtain image nodes in the webpage text.

[0121] Specifically, the node features of the DOM tree can be used to obtain the image nodes in the webpage text, which will not be described in detail here.

[0122] 302: If the text description information of the picture node is not obtained, or the text description information of the picture node is obtained but the similarity between the text description information and the webpage title of the webpage text is less than the preset threshold, then obtain the text description information of the picture node The length of the URL.

[0123] Specifically, the length of the URL of the picture node ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and a device for extracting pictures in webpage content, and belongs to the technical field of computers. The method includes acquiring picture nodes in the webpage content, and acquiring character description information of the picture nodes; extracting the pictures with the highest degree of correlation with the webpage content from the picture nodes according to the character description information of the picture nodes. The device comprises a first acquiring module and a first extracting module. By the method and the device, the degree of correlation of the pictures and the webpage content is increased, so that webpage information themes represented by the extracted pictures are clearer, and user experience is greatly improved.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method and device for extracting pictures in a webpage text. Background technique [0002] With the increase of the amount of network information, when users enter keywords to search for information, they will obtain a large amount of webpage information related to keywords. These webpage information are displayed in text, which makes users need to browse the text in the webpage for information collection. It increases the difficulty for users to obtain information. [0003] In the prior art, the browser provides multimedia information related to the text of the webpage, such as pictures and videos, and displays visual information related to the text of the webpage to the user. Specifically, if the text of the web page contains pictures, then get all the picture nodes in the text of the web page, extract the first picture in the text of the web page or randomly extract a pic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F17/30

Inventor蔡兵张凯徐羽

OwnerTENCENT TECH (SHENZHEN) CO LTD

Method and device for extracting pictures from webpage text

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology