Subject-oriented web page collection system

A collection system, subject-oriented technology, applied in the field of network communication, can solve problems such as difficult to judge the accurate target website webpage, difficult to define the page, and collect a large number of non-theme webpages, etc., to achieve ideal results

Inactive Publication Date: 2013-09-18
BEIHANG UNIV
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Existing topic web crawlers have the following problems: (1) It is difficult to judge whether the webpages in the target website are webpages of this topic when collecting topic webpage information, so it is easy to collect a large number of non-topic webpages during collection. Web page
(2) The advantage of the theme web crawler is that it does not need to traverse the pages, but only needs to select the pages related to the theme to visit, but in the selection process, it is very difficult to define the pages related to the theme

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subject-oriented web page collection system
  • Subject-oriented web page collection system
  • Subject-oriented web page collection system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] like figure 1 As shown, the system of the present invention consists of three modules: a sample training module, a policy search module and a collection module. The sample training module analyzes and calculates the theme feature vector and value through the artificially set web page sample library, and calculates the similarity threshold of the page; the strategy search module is the URL address set retrieved by the control system, and the search range The control is in the candidate seed website; the function of the acquisition module is to accept the URL address sent by the strategy search module, and perform page purification, feature extraction, analysis, and collection and preservation.

[0027] The specific functions and interaction process of several main modules are described below.

[0028] 1. Strategy search module

[0029] The functional design of the strategy search module is based on the information search on the Internet, which is a technology based on ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a subject-oriented web page collection system, which belongs to the field of network communication and is used for the aspect of collecting subject-oriented network information. The subject-oriented web page collection system comprises a sample training module, a strategy search module and a collecting module, wherein the sample training module carries out analysis and calculation through a manually-set web page sample library to obtain subject characteristic vectors and values and a similarity threshold value of pages; the strategy search module control a retrieved URL (Uniform Resource Locator) address set and controls a search range in candidate seed websites; the collecting module receives the URL address set sent by the strategy search module and carries out page purification, characteristic extraction, analysis, collection and storage; and in the process of characteristic analysis, whether the subject characteristic vectors and values and the similarity threshold value of a subject web page need to be filled by manually referring the result of the sample training module is judged. The subject-oriented web page collection system has higher efficiency and stronger page adaptability and effectively solves the problems in the prior art.

Description

technical field [0001] The invention relates to a subject-oriented webpage collection system, which belongs to the field of network communication and is used for subject-oriented network information collection. Background technique [0002] With the rapid growth of WEB information resources, the traditional information search system can no longer guarantee the timely update of information, and because the subject range of collected information is too wide, it has been unable to meet people's growing demand for personalized information search services. In recent years, researchers have continuously proposed the development direction of a new generation of search engines, and topic search is one of the most prominent categories. Compared with ordinary search engines, the search range of topic search engines is relatively small, and the precision and recall are easy to guarantee. In the search process, there is no need to traverse the entire WEB, just select pages related to t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王宝会于雷王丽华王新河尹科
Owner BEIHANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products