Data collecting method and system based on HTML stream processing

A data acquisition and HTML document technology, applied in transmission systems, electrical digital data processing, special data processing applications, etc., can solve the problems of unable to determine the download range and format storage of information that users need, low efficiency, etc., to achieve operating costs Inexpensive, easy to maintain, and captures a wide range and type of effects

Inactive Publication Date: 2010-10-13
FUDAN UNIV
View PDF3 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] This type of method is usually relatively simple in principle, and can be roughly divided into two types. Type 1, traversal download. This type of method exhausts links in the page, and then continues to continue the link for data download. The main disadvantage of this type of method is Unable to determine the scope of download and unable to format and store the information required by the user
The second type is to use the structure of the website itself. This method can solve the problems faced by the first type, but this type of method mainly exists in the downloading of the specified website. It requires the program to simulate every website that needs to be downloaded, which leads to a loss of efficiency. low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data collecting method and system based on HTML stream processing
  • Data collecting method and system based on HTML stream processing
  • Data collecting method and system based on HTML stream processing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] Take "http:www.youku.com / v" as an example to test the source entry of the collected pages, first configure the root entry of the download control template (see figure 2 Shown), set in the fixed-format xml document.

[0040] The user downloads the control template according to the storage data he needs to set. This template indicates which data needs to be stored by the system. The content that the user needs to store is the title of each video, the link of the page, the description and the publication. time as an example, configure the download control template for stored data (see image 3 , this figure is a part of the configured template, which is an instance of the nodes introduced in Table 1).

[0041] The parser works by reading the HTML stream and image 3 Download the content of the stored data in the control template, analyze the data required by the user, and store it in the data storage system. Figure 4 is the result stored in the database.

[0042] The...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of webpage information extraction, and particularly discloses a data collecting method and a data collecting system based on HTML stream processing, wherein the data collecting system consists of the multi-thread collector which guarantees working speed, a download control template which guarantees working accuracy and a data storage system. The system can collect the user-needed network data via a simple template configuration. Practical application proves that the invention has excellent stabilization, high practicability and high efficiency.

Description

technical field [0001] The invention belongs to the technical field of webpage information extraction, and in particular relates to a data collection method and system. Background technique [0002] The work of web page information extraction is to collect a large amount of data on the Internet in a certain way. These data are important materials for research and analysis, machine learning, data mining and other work. There have been many solutions to this problem, but most remain theoretical. At present, web page information extraction techniques can be divided into methods based on web page structure and machine learning methods using probability models. [0003] 1. The method of using probability model learning: [0004] First, by collecting a certain number of web page samples, after selecting the sample type, feature extraction is performed based on experience and some existing knowledge. Then provide the required answers to the classifier through manual labeling. A...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30H04L29/06
Inventor 施洋张奇黄萱菁
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products