Webpage information extraction method and device based on http protocol

A web page information and extraction method technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem of difficult acquisition of target information, and achieve the effect of strong pertinence, improved efficiency and accuracy

Inactive Publication Date: 2014-09-17
北京思特奇信息技术股份有限公司
View PDF3 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The technical problem to be solved by the present invention is to provide an information extraction method and device based o

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extraction method and device based on http protocol
  • Webpage information extraction method and device based on http protocol
  • Webpage information extraction method and device based on http protocol

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

[0027] Such as figure 1 As shown, the present embodiment provides a method for extracting webpage information based on the http protocol, including:

[0028] Template generation step: according to the target page to extract information, customize the corresponding page parsing template, and predefine the target fields and verification rules in the page parsing template;

[0029] Web page address parsing step: parsing the web page address of the target page to obtain the HTML source file of the target page;

[0030] Information extraction step: read and parse the HTML source file of the target page, and extract the page information matching the predefined target field of the page parsing template from the HTML sourc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a webpage information extraction method and device based on an http protocol. The method comprises the steps of template generation, webpage address analysis, information extraction, information checking and information storage, wherein in the template generation step, a corresponding page analysis template is customized according to a target page where information is about to be extracted, and a target field and checking rules are predefined in the page analysis template; in the webpage address analysis step, the webpage address of the target page is analyzed to obtain an HTML source file of the target page; in the information extraction step, the HTML source file of the target page is read and analyzed, and page information matched with the target field predefined in the page analysis template is extracted from the HTML source file of the target page; in the information checking step, whether the extracted page information meets requirements is checked according to the predefined checking rules; in the information storage step, the page information subjected to information checking is stored. According to the webpage information extraction method and device, the page information in a network is subjected to effective data filtration, acquisition and collection through the open http protocol, templates are customized according to different target pages, and extraction of customizing information is achieved.

Description

technical field [0001] The invention relates to the field of information crawling and analysis in network technology, in particular to a method and device for extracting webpage information based on the http protocol. Background technique [0002] The Web 2.0 era is an era of information explosion. Massive data information is flooding in all aspects of work and life. Therefore, the demand for data-based analysis and potential value mining is becoming increasingly urgent. However, in reality, because the data owner has very strict control over the data, many valuable data information cannot be easily collected and extracted. In this context, the importance of data is highlighted, but the availability of data is not high, or even limited. Therefore, how to collect, extract and utilize the target data of concern based on the Internet characteristics of the data has become an urgent problem to be solved. Contents of the invention [0003] The technical problem to be solved b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 马春新董磊
Owner 北京思特奇信息技术股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products