Heterogeneous-webpage-oriented data collection and labeling methodwebpage

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology of data collection and web pages, applied in the direction of network data indexing, network data retrieval, data mining, etc., to achieve high-quality collection, accurate data classification, and increase work efficiency

Inactive Publication Date: 2017-01-04

EAST CHINA NORMAL UNIV

View PDF3 Cites 9 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] The purpose of the present invention is to provide a heterogeneous webpage-oriented data collection and labeling method for the problems existing in the prior art, aiming to solve the integration problem of data collection and classification, so that the application of the next work can be well carried out

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0027] The present invention will be further described below in conjunction with the accompanying drawings and embodiments, but the present invention is not limited to the scope of the described embodiments. This embodiment provides a heterogeneous webpage-oriented data collection and labeling method taking educational and non-educational data as an example, such as figure 1 shown, including the following steps:

[0028] In the data collection stage of this embodiment, Scrapy, a web crawler framework based on Python, is used to collect educational and non-educational data on the Internet. Scrapy provides a fast, high-level screen scraping and WEB scraping framework for crawling WEB sites and extracting structured data from pages. Use this framework to develop crawler applications simply and quickly, and extract corresponding data into the database for various purposes, such as: data mining, information processing, etc.

[0029] In the data labeling stage of this example, the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a heterogeneous-webpage-oriented data collection and labeling method. The method comprises the following steps that Xpath and a regular expression are used for analyzing collected data; judging the corresponding page structure of the webpage through inquiring whether a corresponding DOM node exists or not; providing one piece of priori knowledge for the following labeling work, wherein the labeling can be completed only through performing labeling correction on the pre-labeled data in the labeling process. The method has the advantages that the data collection and data classification operation can be effectively integrated, so that the data mining progress can be efficiently performed. Compared with a conventional data collection method, the method provided by the invention has the advantages that the collection quality is higher; the junk data collection rate is low; the data classification is accurate.

Description

technical field [0001] The invention relates to the technical field of Internet information retrieval and mining, in particular to a data collection and labeling method for heterogeneous webpages. Background technique [0002] A web crawler is a program or script that automatically grabs information on the World Wide Web according to certain rules. The web crawler generally resides on the server, reads the corresponding document by using some given URLs (Uniform Resource Locator, Uniform Resource Locator), using HTTP (Hyper Text Transfer Protocol, Hypertext Transfer Protocol) and other standard protocols, and then uses the document All unvisited URLs included in are used as a new starting point, and the roaming is continued until there is no new URL that meets the conditions. The Internet is the carrier of a large amount of information, how to effectively extract and use this information has become a huge challenge. [0003] A common practice of traditional web crawler tec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

CPCG06F16/9566G06F16/951G06F2216/03

Inventor孙仕亮陈俊宇

OwnerEAST CHINA NORMAL UNIV

Heterogeneous-webpage-oriented data collection and labeling methodwebpage

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology