Webpage information extraction method and device

A web page information, web page technology, applied in the Internet field, can solve problems such as low efficiency and accuracy

Active Publication Date: 2019-08-09
CHINA MOBILE SUZHOU SOFTWARE TECH CO LTD +1
View PDF3 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Embodiments of the present invention provide a method and device for extracting webpage information, which are used to solve the problem of low efficiency and accuracy of the method of relying on manually marking the DOM tree to locate the location of webpage information extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extraction method and device
  • Webpage information extraction method and device
  • Webpage information extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] In order to make the object, technical solution and beneficial effects of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0056] The embodiment of the present invention provides a web page information extraction method, such as figure 1 shown, including the following steps:

[0057] Step S101, acquiring the DOM tree of the webpage and the screenshot of the display page of the webpage.

[0058] Step S102, determine the candidate elements of the webpage and the text information of the candidate elements according to the DOM tree of the webpage.

[0059] Step S103, determine the candidate location information of the webpage according to the screenshot of the displayed webpage of the webpage.

[0060] Step S...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a webpage information extraction method and device. The webpage information extraction method comprises the steps of determining candidate elements and text information of a webpage according to a DOM tree of the webpage; determining candidate position information of the webpage according to the display page screenshot of the webpage; determining a first probability that each candidate element serves as a target extraction element and a second probability that each candidate position serves as a target extraction position according to the candidate position information and the text information of each candidate element; determining a target extraction element and a target extraction position from each candidate element and each candidate position according to the first probability and the second probability; and performing information extraction on the webpage according to the candidate elements determined to be the target extraction elements and the candidate positions determined to be the target extraction positions. The text information and the candidate position information of the candidate elements of the webpage are extracted, and theneural network model and the spatial probability distribution model are used for positioning the extraction position, so that the positioning precision and the fault-tolerant rate of webpage information extraction are improved, and automatic extraction of webpage information is realized.

Description

technical field [0001] Embodiments of the present invention relate to the field of Internet technologies, and in particular, to a method and device for extracting web page information. Background technique [0002] With the rapid growth of information in the Internet, Internet pages have become the most important way for humans to acquire knowledge and information. Traditional search engine technology can quickly sort web pages according to user queries, improving the efficiency of information retrieval. However, for a large number of results fed back by search engines, manual investigation and screening are also required. With the explosive growth of information, this information retrieval method has been difficult to meet the needs of people to fully control information resources. The emergence of knowledge graph technology provides new ideas for solving information retrieval problems. The knowledge graph technology returns the processed and recommended knowledge to user...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/9535G06F16/951G06F17/22G06N5/02
CPCG06N5/022G06F40/14G06F16/9535
Inventor 梁俊蒋忠强全兵胡小克巴伟
Owner CHINA MOBILE SUZHOU SOFTWARE TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products