Construction and network resource collection method of self-adaption network resource collection system

A network resource and collection system technology, applied in the construction of an adaptive network resource collection system and in the field of network resource collection, can solve problems such as limited scope of application and poor scalability, and achieve the effect of strong versatility and high scalability

Inactive Publication Date: 2014-07-02
PEKING UNIV
View PDF5 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The present invention mainly solves the problems of poor expansibility and limited scope of application in the prior art, and provides an adaptive network resource collection method, which can be applied to target network resources of different data types, has a wide application range, and can Strong scalability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Construction and network resource collection method of self-adaption network resource collection system
  • Construction and network resource collection method of self-adaption network resource collection system
  • Construction and network resource collection method of self-adaption network resource collection system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] In this embodiment, the user needs to grab a kind of software project data of the Apache Lucene project: the SVN version library.

[0042] Firstly, a network resource collection system is constructed, and a unified network resource collection module is configured. The network resource collection module includes a unified crawler distribution device and at least one crawler execution unit to be called. The reptile distribution unit consists of:

[0043] The initial unit is used for preprocessing before grabbing information, including checking the validity of the SVN data interface, creating a file directory for storing data, writing log information, obtaining idle sub-threads for grabbing tasks, and creating resources for grabbing Task records, etc.

[0044] The collection unit is used to select different crawler programs to collect the data of the target network resource according to the data type of the target network resource. The specific steps include: finding the ...

Embodiment 2

[0055] In this embodiment, the user needs to grab another type of software project data of the Apache Lucene project: the user mailing list.

[0056] The difference between this embodiment and the first embodiment lies in the configuration of dependent modules, and the rest of the steps are the same as those of the first embodiment. When configuring the dependency module, first write a crawler execution unit for the mailing list according to the unified crawler interface, and add the dependency of the user mailing list resource in the crawler dependency module, as follows:

[0057]

[0058] Wherein, ApacheMailListCrawler is a crawler program capable of grabbing mail information from a mailing list management page under the Apache website.

[0059] The above embodiments are the general process of capturing certain software-related data by the method and the system constructed in the present invention, which can be applied to other data sets that have a clear data interface a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a construction and network resource collection method of a self-adaption network resource collection system, in particular to a construction and network resource collection method of a network resource collection system related to universal open source software projects. According to the method, a network resource collection module and a crawler relying module are included, and corresponding crawler execution units can be configured for the network resource collection module for execution of resource collection according to the relying relation between the network resource collection module established through the crawler relying module and target network resources. Thus, the method has the advantages that the method can adapt to the target network resources of different data types and is high in universality; relying of a specific data source on a certain single-purpose crawler program is decoupled, and high extendibility is achieved.

Description

technical field [0001] The invention relates to the construction of an adaptive network resource collection system and a network resource collection method, in particular to the construction of a general open source software project-related network resource collection system and a network resource collection method. Background technique [0002] Data related to open source software projects is one of the main data sources for computer software research. There are two main technologies related to data collection of existing open source software projects: [0003] One is to obtain open source software project data by writing a single-purpose data scraping program. Researchers first determine the data source of the required data on the Internet, and determine the storage structure and interface of the data in the data source, and then write a web crawler program to grab the data according to the data interface provided by the data source. [0004] The second is to use general...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 邹艳珍张灵箫
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products