Webpage information extracting method, device and terminal

A technology for web page information and text, applied in the electronic field, can solve the problems of affecting retrieval results, wasting user reading time, etc., and achieve the effect of improving the extraction speed

Active Publication Date: 2015-01-07
GUANGZHOU KINGSOFT NETWORK TECH
View PDF5 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Web page information includes text content, advertisement information, email login information, etc., and the text content is generally in the middle of the web page display interface. In the existing technical solution, the crawler searches the entire web page information every time to extract use

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extracting method, device and terminal
  • Webpage information extracting method, device and terminal
  • Webpage information extracting method, device and terminal

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0073] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0074] Please refer to figure 1 , figure 1 It is a flow chart of the first embodiment of a web page information extraction method proposed by the present invention. As shown in the figure, the information extraction method in the embodiment of the present invention includes:

[0075] S101. Parse webpage information and generate a tag tree to obtain the webpage information, where the tag tree includes a plurality of nodes, and each node of the tag tre...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a webpage information extracting method. The method comprises analyzing webpage information and obtaining the tag tree of the webpage information, wherein the tag tree comprises a plurality of nodes, and every node corresponds to one content block of the webpage information; obtaining a pre-established webpage information word library, wherein the webpage information word library comprises multiple types of word sets, and every word in the word sets corresponds to one weight; according to the pre-established webpage information word library, obtaining the text content blocks of the webpage information by traversing the tag tree of the webpage information; according to the text content blocks of the webpage information, extracting at least one content element of the webpage information. The embodiment of the invention also discloses a webpage information extracting device and terminal. The webpage information extracting method, device and terminal can increase the webpage information extracting speed.

Description

technical field [0001] The present invention relates to the field of electronic technology, in particular to a web page information extraction method, device and terminal. Background technique [0002] Search engines include crawlers, indexers, and retrievers. Crawlers can collect information on the Internet and write the collected information into databases; indexers can extract index items from the information collected by crawlers to generate indexes for document libraries. table; the retriever can query the search documents related to the query information submitted by the user according to the index table of the document library, so as to display the search documents to the user. Therefore, whether the search engine can finally show the user a satisfactory search answer , a large factor depends on the information extracted by the crawler, and the extraction method of the crawler determines the information extracted by the crawler. [0003] Web page information includes...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/95
Inventor 邝锐强
Owner GUANGZHOU KINGSOFT NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products