Data collecting method and system based on HTML stream processing

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A data acquisition and HTML document technology, applied in transmission systems, electrical digital data processing, special data processing applications, etc., can solve the problems of unable to determine the download range and format storage of information that users need, low efficiency, etc., to achieve operating costs Inexpensive, easy to maintain, and captures a wide range and type of effects

Inactive Publication Date: 2010-10-13

FUDAN UNIV

View PDF3 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] This type of method is usually relatively simple in principle, and can be roughly divided into two types. Type 1, traversal download. This type of method exhausts links in the page, and then continues to continue the link for data download. The main disadvantage of this type of method is Unable to determine the scope of download and unable to format and store the information required by the user

The second type is to use the structure of the website itself. This method can solve the problems faced by the first type, but this type of method mainly exists in the downloading of the specified website. It requires the program to simulate every website that needs to be downloaded, which leads to a loss of efficiency. low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0039] Take "http:www.youku.com / v" as an example to test the source entry of the collected pages, first configure the root entry of the download control template (see figure 2 Shown), set in the fixed-format xml document.

[0040] The user downloads the control template according to the storage data he needs to set. This template indicates which data needs to be stored by the system. The content that the user needs to store is the title of each video, the link of the page, the description and the publication. time as an example, configure the download control template for stored data (see image 3 , this figure is a part of the configured template, which is an instance of the nodes introduced in Table 1).

[0041] The parser works by reading the HTML stream and image 3 Download the content of the stored data in the control template, analyze the data required by the user, and store it in the data storage system. Figure 4 is the result stored in the database.

[0042] The...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the technical field of webpage information extraction, and particularly discloses a data collecting method and a data collecting system based on HTML stream processing, wherein the data collecting system consists of the multi-thread collector which guarantees working speed, a download control template which guarantees working accuracy and a data storage system. The system can collect the user-needed network data via a simple template configuration. Practical application proves that the invention has excellent stabilization, high practicability and high efficiency.

Description

technical field [0001] The invention belongs to the technical field of webpage information extraction, and in particular relates to a data collection method and system. Background technique [0002] The work of web page information extraction is to collect a large amount of data on the Internet in a certain way. These data are important materials for research and analysis, machine learning, data mining and other work. There have been many solutions to this problem, but most remain theoretical. At present, web page information extraction techniques can be divided into methods based on web page structure and machine learning methods using probability models. [0003] 1. The method of using probability model learning: [0004] First, by collecting a certain number of web page samples, after selecting the sample type, feature extraction is performed based on experience and some existing knowledge. Then provide the required answers to the classifier through manual labeling. A...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30H04L29/06

Inventor施洋张奇黄萱菁

OwnerFUDAN UNIV

Data collecting method and system based on HTML stream processing

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology