Template building method, page content grasping method and device, medium and equipment

A page content and page technology, applied in special data processing applications, instruments, network data retrieval, etc., can solve problems such as increasing the workload of technicians, and requiring higher technical and experience requirements for technicians

Active Publication Date: 2018-07-06
NEUSOFT CORP
View PDF8 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the prior art, when crawler technology is used to obtain information on the Internet, it is necessary to manually configure the crawler template to determine the path to grab each informat

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Template building method, page content grasping method and device, medium and equipment
  • Template building method, page content grasping method and device, medium and equipment
  • Template building method, page content grasping method and device, medium and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0083] Specific embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present disclosure, and are not intended to limit the present disclosure.

[0084] figure 1 As shown, it is a flow chart of a template construction method for page content capture provided according to an embodiment of the present disclosure. like figure 1 As shown, the method includes:

[0085] In S11, HTML source codes of multiple pages are acquired.

[0086] Among them, in common websites, there can be multiple sections under the same website, and multiple subsections under multiple sections. For example, under NetEase's website (http: / / www.163.com / ), enter the website of the sports section It is http: / / sports.163.com / , the URL to enter the technology section is http: / / tech.163.com / , and the website to enter its sub-section IT...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a template building method, a page content grasping method and device, a medium and equipment. The method comprises the steps that HTML source codes of a plurality of pages are acquired; according to the HTML codes of each page, and a DOM tree corresponding to each page is generated; child nodes of a bottom layer of each DOM tree are extended to child nodes with text identifications; a page structure standard template is generated by utilizing a plurality of DOM trees after the extended operation. Therefore, a unified page structure standard template is built for the pages with similar page structures, not only is quickly and accurately grasping massive page contents achieved, but also the manual configuration of technical staffs is avoided, the workload is effectively reduced, and the work efficiency is improved; the accuracy and applicable ranges of the page structure standard template can be effectively improved, in addition, the requirements for technologies and experience of the technical staffs are reduced, and the use experience of a user is improved.

Description

technical field [0001] The present disclosure relates to the field of information acquisition, and in particular, relates to a template construction method, a page content capture method and device, media and equipment. Background technique [0002] With the development of the Internet, the amount of information carried by the Internet itself is also increasing. In order to facilitate the acquisition of information on the Internet, crawler technology emerged as the times require. When acquiring information on the Internet through crawler technology, generally after acquiring the HTML source code of the webpage, XPATH is used to filter the HTML path. In the prior art, when crawler technology is used to obtain information on the Internet, it is necessary to manually configure the crawler template to determine the path to grab each information from the Internet, which not only increases the workload of the technicians, but also has a great impact on the technical skills of the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9027G06F16/951G06F16/955
Inventor 刘嘉伟崔朝辉赵立军张霞
Owner NEUSOFT CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products