Article analysis method based on DOM tree path matching

A DOM tree and path matching technology, applied in the field of article parsing, can solve the problems of inconvenient data statistics for statisticians, increase the difficulty of statisticians, reduce work efficiency, etc., and achieve the effect of facilitating information statistics and improving the ability of article parsing.

Inactive Publication Date: 2021-03-12
清创网御(合肥)科技有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] There are certain differences in the structure and format of articles on different websites, which is not convenient for statisticians to carry out data statistics. Now it is impossible to generate a unified format for articles retrieved from different websites. Various article formats greatly increase the difficulty of work for statisticians and reduce the work efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Article analysis method based on DOM tree path matching

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0016] The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. The embodiments of the present invention have been presented for purposes of illustration and description, but are not intended to be exhaustive or to limit the invention to the form disclosed. Many modifications and changes will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to better explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention and design various embodiments with various modifications as are suited to the particular use.

[0017] The system template library stores specific parsing templates for different websites. The article parsing method based on DOM tree path matching is to obtain the URL of the page where the article to be parsed is located, and intercept its second-level dom...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an article analysis method based on DOM tree path matching. The method comprises the following steps: storing specific analysis templates facing different websites in a systemtemplate library; loading an article on a page through a webpage URL (Uniform Resource Locator) of a website W, and analyzing the article into a DOM (Document Object Model) tree according to a hierarchical label to obtain each node and a path thereof in the article; matching the paths of different nodes in the specific analysis template of the website W with each path of the DOM tree to obtain node content corresponding to the successfully matched DOM tree path; matching the regular expression of the node content in the specific analysis template with the node content corresponding to the successfully matched DOM tree path, and storing the successfully matched node content in an analysis result; packaging and storing the analysis result according to a unified format. Specific analysis templates are configured for different websites, and articles on the different websites are analyzed into a uniform format through the specific analysis templates, and therefore information statistics isfacilitated for statistics personnel.

Description

technical field [0001] The present invention relates to the technical field of article parsing, and more specifically, relates to an article parsing method based on DOM tree path matching. Background technique [0002] There are certain differences in the structure and format of articles on different websites, which is not convenient for statisticians to carry out data statistics. Now it is impossible to generate a unified format for articles retrieved from different websites. Various article formats greatly increase the difficulty of work for statisticians and reduce the work efficiency. Contents of the invention [0003] The purpose of the present invention is to provide a method for parsing articles based on DOM tree path matching, configure specific parsing templates for different websites, and parse articles on different websites into a unified format through specific parsing templates to facilitate information statistics for statisticians to solve The technical prob...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/955G06F16/958
CPCG06F16/9566G06F16/986
Inventor 庞文俊陈继张长志黄星廖开枫李小超伊晓强
Owner 清创网御(合肥)科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products