Data collecting method and system based on HTML stream processing

A data acquisition and HTML document technology, applied in transmission systems, electrical digital data processing, special data processing applications, etc., can solve the problems of unable to determine the download range and format storage of information that users need, low efficiency, etc., to achieve operating costs Inexpensive, easy to maintain, and captures a wide range and type of effects
CN101859321AInactive Publication Date: 2010-10-13FUDAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
FUDAN UNIV
Publication Date
2010-10-13
Estimated Expiration
Not applicable · inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention belongs to the technical field of webpage information extraction, and particularly discloses a data collecting method and a data collecting system based on HTML stream processing, wherein the data collecting system consists of the multi-thread collector which guarantees working speed, a download control template which guarantees working accuracy and a data storage system. The system can collect the user-needed network data via a simple template configuration. Practical application proves that the invention has excellent stabilization, high practicability and high efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the technical field of webpage information extraction, and in particular relates to a data collection method and system. Background technique

[0002] The work of web page information extraction is to collect a large amount of data on the Internet in a certain way. These data are important materials for research and analysis, machine learning, data mining and other work. There have been many solutions to this problem, but most remain theoretical. At present, web page information extraction techniques can be divided into methods based on web page structure and machine learning methods using probability models.

[0003] 1. The method of using probability model learning:

[0004] First, by collecting a certain number of web page samples, after selecting the sample type, feature extraction is performed based on experience and some existing knowledge. Then provide the required answers to the classifier through manual labeling. A...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More