Method and device for extracting pictures from webpage text

An extraction method and image technology, applied in the computer field, can solve problems affecting user experience, unclear theme of webpage information, low correlation between images and webpage text, and achieve the effects of improving user experience, facilitating search for images, and clear themes

Active Publication Date: 2016-04-06
TENCENT TECH (SHENZHEN) CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Extracting the first picture in the webpage text or randomly extracting a picture from all picture nodes as a picture representing the webpage may extract pictures that have nothing to do with the content of the webpage text, resulting in a low correlation between the extracted pictures and the webpage text , making the theme of the webpage information represented by the picture unclear, which affects the user experience

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting pictures from webpage text
  • Method and device for extracting pictures from webpage text
  • Method and device for extracting pictures from webpage text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0063] see figure 1 , the present embodiment provides a method for extracting pictures in the text of a web page, and the process of the method is as follows:

[0064] 101: Obtain an image node in the webpage text, and obtain text description information of the image node.

[0065] 102: According to the text description information of the image node, extract the image most relevant to the webpage text from the acquired image nodes.

[0066] Wherein, according to the text description information of the image node, the image most relevant to the webpage text is extracted from the obtained image node, including:

[0067] Calculate the similarity between the text description information and the webpage title of the webpage text;

[0068] From the image nodes whose similarity is greater than or equal to a preset threshold, the image with the largest similarity is extracted.

[0069] Specifically, obtain the text description information of the image node, including:

[0070] Obt...

Embodiment 2

[0087] see figure 2 , the present embodiment provides a method for extracting pictures in the text of a web page, and the process of the method is as follows:

[0088] 201: Obtain image nodes in the webpage text.

[0089] In this embodiment, the picture node is a part of the webpage text, and usually the webpage text includes pictures and text, etc. In order to facilitate picture extraction, the webpage text can be divided in advance to obtain picture nodes and text nodes. Specifically, the webpage may be divided through a DOM (Document Object Model, Document Object Model) tree of the webpage, and of course other methods may also be used, which is not limited in the present invention. Correspondingly, the node features of the DOM tree can be used to obtain the image nodes in the webpage text, which will not be described in detail here.

[0090] 202: Obtain attribute information of the aforementioned image node.

[0091] The attribute information of the picture node include...

Embodiment 3

[0119] see image 3 , the present embodiment provides a method for extracting pictures in the text of the webpage. The difference with Embodiment 2 is that in this embodiment, the pictures in the text of the webpage are extracted by the length of the URL of the picture node. The process of the method is specifically as follows:

[0120] 301: Obtain image nodes in the webpage text.

[0121] Specifically, the node features of the DOM tree can be used to obtain the image nodes in the webpage text, which will not be described in detail here.

[0122] 302: If the text description information of the picture node is not obtained, or the text description information of the picture node is obtained but the similarity between the text description information and the webpage title of the webpage text is less than the preset threshold, then obtain the text description information of the picture node The length of the URL.

[0123] Specifically, the length of the URL of the picture node ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting pictures in webpage content, and belongs to the technical field of computers. The method includes acquiring picture nodes in the webpage content, and acquiring character description information of the picture nodes; extracting the pictures with the highest degree of correlation with the webpage content from the picture nodes according to the character description information of the picture nodes. The device comprises a first acquiring module and a first extracting module. By the method and the device, the degree of correlation of the pictures and the webpage content is increased, so that webpage information themes represented by the extracted pictures are clearer, and user experience is greatly improved.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method and device for extracting pictures in a webpage text. Background technique [0002] With the increase of the amount of network information, when users enter keywords to search for information, they will obtain a large amount of webpage information related to keywords. These webpage information are displayed in text, which makes users need to browse the text in the webpage for information collection. It increases the difficulty for users to obtain information. [0003] In the prior art, the browser provides multimedia information related to the text of the webpage, such as pictures and videos, and displays visual information related to the text of the webpage to the user. Specifically, if the text of the web page contains pictures, then get all the picture nodes in the text of the web page, extract the first picture in the text of the web page or randomly extract a pic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 蔡兵张凯徐羽
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products