Web page data acquisition method of using context environment rules

A webpage data and acquisition method technology, applied in the direction of electronic digital data processing, special data processing applications, instruments, etc., can solve problems such as users' reading troubles, and achieve the effect of high content extraction quality, high writing efficiency, and simple definition

Inactive Publication Date: 2018-07-06
孙翔
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In these applications, any even a very small amount of irrelevant information inside the webpage is not filtered, which will cause trouble for users to read.
At present, there have been products specially used to extract the core content of web pages in the computer industry, such as Lixto, Kapowtech, and Mozenda. The extraction strategies used by these products are different, some use the DOM tree method, some use the visual text block method, and some The use density method; these methods have their own different applicable occasions, simply using one method may not be able to achieve a more ideal content extraction effect in the extraction of a specific page; how to design a tool to integrate the above different extraction technologies, It is very important to be able to provide the discrimination function of the ideal technology that should be used in different parts of the web page

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page data acquisition method of using context environment rules

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0010] The technical solutions of the present invention will be further described in detail below in conjunction with specific embodiments.

[0011] A web page data acquisition method using contextual environment rules, including content extraction rules and rule matching algorithms, the content extraction rules are mainly defined by the user according to the syntax of the extraction rules; the content extraction rules adopt tree inheritance similar to object-oriented languages The specific and specialized extraction rules are inherited from the general rules; the syntax of the extraction rules is a condition-action grammar mode; the condition part includes DOM node attributes and context attributes, and the DOM node attributes include tag names, node class names, Node ID, node font name, node width attribute, node height attribute, and even some calculated values ​​inside the DOM node, such as the number of pictures contained inside, the number of strings, text length, link de...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a web page data acquisition method of using context environment rules. The method includes content extraction rules and a rule matching algorithm. The content extraction rulesare mainly voluntarily defined by a user according to extraction rule grammar. A tree-shape inheritance structure is used for the content extraction rules. A condition-action grammar mode is used forthe extraction rule grammar. Condition parts include DOM node attributes and context attributes. Action parts include classifying nodes of the matching conditions, upgrading the context attributes, and applying a certain specified content extraction technology. According to the method, various main data extraction technologies of the field of data mining are fused, a more precise web page data extraction effect is realized on the basis thereof, rule grammar definition extraction of the scheme of the method is simple, learning is easy, use is easy, writing efficiency is high, precise application of different extraction methods of the same page is realized through the rule matching conditions, and content extraction quality is higher than that of existing same-type products.

Description

technical field [0001] The invention relates to the field of data mining, in particular to a web page data acquisition method using contextual environment rules. Background technique [0002] Web page content acquisition is a complex process, which includes determining which part of the page contains the core text content, and ignoring content that is not relevant to the topic, such as headers, footnotes, navigation bars, advertisements, etc. Among these steps, the most critical It is to identify the core text content. Recognizing core text has a wide range of applications, such as generating text indexes, generating web page summaries, providing web page reading functions for users with visual impairments, and providing optimized web content for smart devices with small screens. In these applications, any even a very small amount of irrelevant information inside the webpage that is not filtered will cause trouble for the user's reading. At present, there have been product...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/24564G06F16/24575G06F16/2465G06F16/9535
Inventor 孙翔
Owner 孙翔
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products