A web page information extraction method and device

A web page information and web page technology, applied in the Internet field, can solve problems such as low efficiency and accuracy

Active Publication Date: 2021-06-15
CHINA MOBILE SUZHOU SOFTWARE TECH CO LTD +1
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Embodiments of the present invention provide a method and device for extracting webpage information, which are used to solve the problem of low efficiency and accuracy of the method of relying on manually marking the DOM tree to locate the location of webpage information extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A web page information extraction method and device
  • A web page information extraction method and device
  • A web page information extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] In order to make the objects, technical solutions and beneficial effects of the present invention, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely intended to illustrate the invention and are not intended to limit the invention.

[0056] The embodiment of the present invention provides a web page information extraction method, such as figure 1 As shown, including the following steps:

[0057] Step S101, a screenshot of the display page of the DOM tree and web pages of the web page.

[0058] Step S102, determine the candidate element of the web page and the text information of the candidate element, depending on the DOM tree of the web page.

[0059] Step S103, determine the candidate location information of the web page according to the web page screenshot.

[0060] Step S104, according to each candidate location informa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the present invention provides a method and device for extracting webpage information, including: determining the candidate elements and text information of the webpage according to the DOM tree of the webpage; determining the candidate location information of the webpage according to the screenshot of the displayed page of the webpage; The text information of each candidate element determines the first probability of each candidate element as a target extraction element and the second probability of each candidate position as a target extraction position; determine the target from each candidate element and each candidate position according to the first probability and the second probability Extracting elements and target extraction positions; performing information extraction on the webpage according to the candidate elements determined as target extraction elements and the candidate positions determined as target extraction positions. By extracting the text information and candidate position information of the candidate elements of the web page, the neural network model and the spatial probability distribution model are used to locate the extraction position, which improves the positioning accuracy and fault tolerance rate of the web page information extraction, and realizes the automatic extraction of web page information.

Description

Technical field [0001] Embodiments of the present invention relate to the field of Internet technology, and more particularly to a web information extraction method and apparatus. Background technique [0002] With the rapid growth of information in the Internet, the Internet page has become the most important way for human acquisition knowledge and information. Traditional search engine technology can raise the web page according to the user query, improve the efficiency of information retrieval. However, for a large number of results of search engine feedback, manual investigation and screening are required. With the explosive growth of information, this information search method has been difficult to meet the needs of people to control information resources, and the emergence of knowledge map technology provides new ideas for information retrieval issues. Knowledge map technology returns to users in a graphic manner, which is the foundation and bridge of intelligent language r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535G06F16/951G06F40/143G06F40/146G06N5/02
CPCG06N5/022G06F40/14G06F16/9535
Inventor 梁俊蒋忠强全兵胡小克巴伟
Owner CHINA MOBILE SUZHOU SOFTWARE TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products