Data collecting method and system based on HTML stream processing
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- FUDAN UNIV
- Publication Date
- 2010-10-13
- Estimated Expiration
- Not applicable · inactive patent
Smart Images
Figure 1 Figure 2 Figure 3
Abstract
Description
technical field
[0001] The invention belongs to the technical field of webpage information extraction, and in particular relates to a data collection method and system. Background technique
[0002] The work of web page information extraction is to collect a large amount of data on the Internet in a certain way. These data are important materials for research and analysis, machine learning, data mining and other work. There have been many solutions to this problem, but most remain theoretical. At present, web page information extraction techniques can be divided into methods based on web page structure and machine learning methods using probability models.
[0003] 1. The method of using probability model learning:
[0004] First, by collecting a certain number of web page samples, after selecting the sample type, feature extraction is performed based on experience and some existing knowledge. Then provide the required answers to the classifier through manual labeling. A...