Unlock instant, AI-driven research and patent intelligence for your innovation.

Self-adaptive data extraction method of structure change webpage

A technology of data extraction and structural change, applied in network data indexing, network data retrieval, electronic digital data processing, etc., can solve the problems of meaningless data extraction methods, anti-risk capabilities without automated web page extraction methods, etc. The effect of risk capacity and high stability

Inactive Publication Date: 2019-08-02
重庆紫光华山智安科技有限公司
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, although the automatic web page data extraction method is mature, there is no solution specifically for the risk resistance and stability of the automated web page extraction method. Once the data structure of the same type of web page changes during the data extraction process, the previously automatically generated The data extraction method will become meaningless, so a new technical means is needed to improve the risk resistance and stability of the automated web page extraction method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Self-adaptive data extraction method of structure change webpage
  • Self-adaptive data extraction method of structure change webpage
  • Self-adaptive data extraction method of structure change webpage

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] Embodiments of the present invention are described below through specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific implementation modes, and various modifications or changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that, in the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

[0031] It should be noted that the diagrams provided in the following embodiments are only schematically illustrating the basic ideas of the present invention, and only the components related to the present invention are shown in the diagrams rather than the number, shape and shape of the compo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a self-adaptive data extraction method for a structural change webpage, which comprises the following steps of: acquiring webpage extraction data, and judging whether the webpage extraction data is abnormal or not; when the webpage extraction data is abnormal, selecting a plurality of extracted webpages before the abnormality, and obtaining the extracted contents of the extracted webpages; obtaining core semantics of the extracted content according to the extracted content; carrying out similarity comparison on the core semantics and the information of each node of the current webpage after abnormality, and obtaining a most matched node; repeating the above steps until the similarity judgment of the contents of all the selected extracted webpages and the corresponding current webpages is completed; according to a similarity judgment result, determining all content positions in the current webpage corresponding to the extracted content, further acquiring an acquisition path, and finishing self-adaptive data extraction. According to the method, the anti-risk capability of the automatic webpage extraction method is improved, higher stability is achieved, and themethod can cope with complex and changeable network environments.

Description

technical field [0001] The invention relates to the field of computer applications, in particular to an adaptive data extraction method for web pages with structural changes. Background technique [0002] The data structure in a webpage is generally hidden in each webpage, and there are a large amount of one-class data in a webpage with the same structure. There are already many webpage data extraction methods in the prior art, among which, the simplest method is for each type of website Writing a specific template, but the workload of this method is huge. In order to reduce the workload, in 2010, three people including Nie Tiezheng from Northeastern University published a web page data extraction method based on Extensible Markup Language query. This method The workload of web page data extraction is reduced and the automation is improved. In 2014, Huang Yihua of Nanjing University and others published a full-scale web information extraction integration method. This method ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06F17/27
CPCG06F16/951G06F40/289G06F40/30
Inventor 杨杰
Owner 重庆紫光华山智安科技有限公司