Heterogeneous-webpage-oriented data collection and labeling methodwebpage

A technology of data collection and web pages, applied in the direction of network data indexing, network data retrieval, data mining, etc., to achieve high-quality collection, accurate data classification, and increase work efficiency

Inactive Publication Date: 2017-01-04
EAST CHINA NORMAL UNIV
View PDF3 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a heterogeneous webpage-oriented data collection and labeling method for the problems existing in the prior art, aiming to solve the integration problem of data collection and classification, so that the application of the next work can be well carried out

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Heterogeneous-webpage-oriented data collection and labeling methodwebpage
  • Heterogeneous-webpage-oriented data collection and labeling methodwebpage

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be further described below in conjunction with the accompanying drawings and embodiments, but the present invention is not limited to the scope of the described embodiments. This embodiment provides a heterogeneous webpage-oriented data collection and labeling method taking educational and non-educational data as an example, such as figure 1 shown, including the following steps:

[0028] In the data collection stage of this embodiment, Scrapy, a web crawler framework based on Python, is used to collect educational and non-educational data on the Internet. Scrapy provides a fast, high-level screen scraping and WEB scraping framework for crawling WEB sites and extracting structured data from pages. Use this framework to develop crawler applications simply and quickly, and extract corresponding data into the database for various purposes, such as: data mining, information processing, etc.

[0029] In the data labeling stage of this example, the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a heterogeneous-webpage-oriented data collection and labeling method. The method comprises the following steps that Xpath and a regular expression are used for analyzing collected data; judging the corresponding page structure of the webpage through inquiring whether a corresponding DOM node exists or not; providing one piece of priori knowledge for the following labeling work, wherein the labeling can be completed only through performing labeling correction on the pre-labeled data in the labeling process. The method has the advantages that the data collection and data classification operation can be effectively integrated, so that the data mining progress can be efficiently performed. Compared with a conventional data collection method, the method provided by the invention has the advantages that the collection quality is higher; the junk data collection rate is low; the data classification is accurate.

Description

technical field [0001] The invention relates to the technical field of Internet information retrieval and mining, in particular to a data collection and labeling method for heterogeneous webpages. Background technique [0002] A web crawler is a program or script that automatically grabs information on the World Wide Web according to certain rules. The web crawler generally resides on the server, reads the corresponding document by using some given URLs (Uniform Resource Locator, Uniform Resource Locator), using HTTP (Hyper Text Transfer Protocol, Hypertext Transfer Protocol) and other standard protocols, and then uses the document All unvisited URLs included in are used as a new starting point, and the roaming is continued until there is no new URL that meets the conditions. The Internet is the carrier of a large amount of information, how to effectively extract and use this information has become a huge challenge. [0003] A common practice of traditional web crawler tec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566G06F16/951G06F2216/03
Inventor 孙仕亮陈俊宇
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products