Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A multi-data source oriented network data collection and presentation method

A technology of network data and multiple data sources, which is applied in the direction of network data retrieval, network data indexing, and other database retrieval, etc. problem, to achieve the effect of avoiding Caton

Active Publication Date: 2019-03-29
BEIJING INFORMATION SCI & TECH UNIV
View PDF5 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

General-purpose crawlers can collect all documents that can be parsed, mainly through URL filtering technology to achieve this process, but they have the problem that the crawling results are the same and cannot provide different search results for people with different backgrounds
Incremental crawlers only crawl newly added pages or changed content to keep local pages updated in a timely manner. The disadvantage is that they need to crawl changed pages multiple times at different frequencies in a short period of time. Mechanism websites will increase the difficulty of data crawling and affect crawling efficiency
The focused crawler filters the pages, compares the content of the page with the topic to be searched, and only uses the content of the page when a certain ratio is reached. The problem is that there are multiple topics in the content of the webpage to be crawled, because of the existence of other irrelevant topics Masks the relevance of highly relevant topics in the web page, resulting in inaccurate calculation of topic relevance for the entire page
The problem with deep crawlers is that when the nesting depth of web pages is too high, it is easy to lead to crawling time is too long or even "can't come back"

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A multi-data source oriented network data collection and presentation method
  • A multi-data source oriented network data collection and presentation method
  • A multi-data source oriented network data collection and presentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0022] 1. Web crawler algorithm design

[0023] In essence, a crawler is an Internet information collection tool. According to the system structure and implementation technology, web crawlers can be divided into the following types: General Purpose Web Crawler, Focused Web Crawler, Incremental Web Crawler, and Deep Web Crawler. Web crawler (Deep Web Crawler). The website characteristics of different media platforms are different, the web page structure is complex and various, and the web crawler cannot adopt a single type. Therefore, the present invention combines the general-purpose crawler and the deep web crawler to realize the data collection method. Using the breadth-first traversal algorithm, the design of figure 1 The web crawler shown.

[0024] In the specific implementation of the algorithm, two ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multi-data source oriented network data collection and presentation method. Based on the research of data collection strategies of six media platforms, such as Sina Weibo, People's Daily, Baidu Encyclopedia, Baidu Tieba, WeChat Public Homepage and Dongfang Wealth Stock Bar, the method adopts Servlet background scheduling technology to integrate the web crawler oriented tomulti-data sources, and solves the data collection problem of different media platforms. In this implementation, Firstly, the manual operation such as simulated login is realized by means of Web application test kit Selenium. Secondly, the Xpath element query technology is used to analyze the source code of the web page, and the data information is extracted and stored in the database. Finally, the crawled data is read out from the database and displayed on the front-end page. Experiments show that the crawler achieves the maximum collection efficiency on the premise of ensuring data integrity.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and relates to a multi-data source-oriented network data collection and display method. Background technique [0002] At present, network data collection is mainly completed by using web spiders (or data collection robots) for vertical fields combined with page analysis and other related technologies. At this stage, there are many enterprises engaged in "mass data collection" in China, most of which use vertical crawler technology, and some companies comprehensively use a variety of related technologies on this basis, for example: "Train Collector" adopts vertical crawler + Network radar + information tracking and automatic sorting + automatic indexing technology combines massive data collection and post-processing; the "Octopus Collector" of Shenzhen Vision Information Technology Co., Ltd. is based on a completely independently developed distributed cloud computing platform. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06F16/955
Inventor 张仰森曾健荣陈若愚黄改娟王胜
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products