Fine-grained webpage information acquisition method

A collection method and technology of webpage information, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of narrow application, low efficiency, limited data volume, etc., and achieve the effect of reducing manual operations and high efficiency

Inactive Publication Date: 2006-10-11
NANJING UNIV OF TECH
View PDF0 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to solve the current web page information collection method that uses dictionary-type and format-differentiated collection methods. The amount of data that can be subdivided and collected is very limited, the efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Fine-grained webpage information acquisition method
  • Fine-grained webpage information acquisition method
  • Fine-grained webpage information acquisition method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0025] like figure 1 , 2 , 3 shown.

[0026] A method for collecting artificially fine-grained webpage information, comprising the following steps:

[0027] a. Use the roaming method of traditional network robots to collect well-structured or semi-structured web content and their URLs on the Internet;

[0028] b. Distinguish templates of URLs (i.e. webpage addresses) of collected webpages, use the part before the symbol "?" The identification mark (some webpages that use post or cookie to pass parameters);

[0029] c. Manually collect the information elements required for one or more purposes in the content of the above-mentioned webpage, and at the same time put them into the local database together with the aforementioned identification marks of this webpage; the information elements mentioned here refer to fine-grained collection information, su...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a simulated granule page information collecting method which can simulate the manual collecting method, and solve the problems of present word-type and format-type that used to collect the information of page with lower efficiency and high cost, etc. wherein, it can widely used in building search engine, commercial information collection, and electric website information search. Compared to the rough granule full-text information collected by the traditional network robot, the invention can improve the value of fined collected information, which said information can be directly used to analyze commercial information, the database of similar website, etc.

Description

technical field [0001] The invention relates to a network information collection method that can be widely used to establish a fine-grained query search engine, in particular to a method for artificially imitating fine-grained webpage information collection. Background technique [0002] At present, related technologies at home and abroad include two types of methods: dictionary-type analysis and collection, and format-differentiated collection. Among them, the applicable area of ​​format-differentiated collection is very narrow, and the amount of data that can be subdivided and collected is very limited (such as the title of an article in a web page, The title in the country page, etc.), is rarely used at home and abroad. At present, the method of dictionary analysis is mostly used, such as: *company; *university; *hospital; *yuan. This method has many disadvantages such as low efficiency, low accuracy, high construction cost, and narrow application. Using this method to m...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 于磊潘郁
Owner NANJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products