Method for constructing webpage crawler based on repeated removal of news

A web crawler and construction method technology, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve problems such as low algorithm efficiency, large waste of resources, and difficulty in data maintenance, and achieve convenient data maintenance and small waste of resources. , the effect of saving storage resources

Active Publication Date: 2010-04-14
ZHEJIANG UNIV
View PDF0 Cites 48 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] In order to overcome the disadvantages of low algorithm efficiency in the prior art, easy to grab web pages with repeated content, large waste of resources, and difficult data maintenance, the present invention provides an algorithm with high efficiency, avoiding web pages with duplicate content, and little waste of resources. Construction method of web crawler based on deduplication of news with convenient data maintenance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for constructing webpage crawler based on repeated removal of news
  • Method for constructing webpage crawler based on repeated removal of news
  • Method for constructing webpage crawler based on repeated removal of news

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] Refer to attached figure 1 , 2 、4

[0042] A method for constructing a web crawler based on deduplication of news, comprising the following steps:

[0043] 1. A method for constructing a web crawler based on deduplication of news, comprising the following steps:

[0044] 1), construct the parser that can extract the title and content of the news in the webpage, and analyze the news webpage with the parser;

[0045] 2), build the collection of news web pages to form a news collection; set the threshold value of the similarity between the webpage currently grabbed and the news web pages in the news collection, and the similarity is characterized by the degree of repetition of the content;

[0046]3), comparing the currently captured news webpage with the news collection, and judging whether the similarity between them is higher than the threshold;

[0047] (3.1) Extract the keywords in the text and the weight of each keyword from the text of the news title using Chine...

Embodiment 2

[0063] Refer to attached figure 1 , 3 、4

[0064] The difference between the present embodiment and the first embodiment is: if the set C is judged to be non-repetitive news through (3.4), then the news body text is extracted using the Chinese word segmentation technology to extract the keywords and the weight of each keyword in the text, Perform (3.2) to (3.4) in sequence again; if this judgment is still non-repeating news, then add this news to the news set. The rest is the same.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for constructing a webpage crawler based on the repeated removal of news, which comprises the following steps: constructing an analyzer for analyzing news webpages; constructing a news set; setting a threshold value of the similarity among the webpages; comparing the currently grabbed news webpage with the news set and judging whether the similarity is higher than the threshold value or not; if the similarity is lower than the threshold value, adding the current webpage into the news set; if the similarity is higher than the threshold value, discarding the news and grabbing the next webpage; grabbing a URL of the current webpage and judging whether the URL points at the news webpage or not, if so, judging whether the URL is accessed or not; otherwise, discarding; if the URL is accessed, discarding the URL; if the URL is not accessed, storing the URL into a queue to be accessed; sequentially extracting the URL from the queue to be accessed to access; and repeatedly executing the steps. The invention has the advantages of high algorithm efficiency, less resource waste and convenient data maintenance and prevents from grabbing the webpage with repeated content.

Description

technical field [0001] The invention relates to a construction method of a webpage crawler, in particular to a component method of a webpage crawler based on deduplication of news. Background technique [0002] In this era of information explosion, Internet media has gradually replaced traditional media such as TV and newspapers with its rapid news release and extensive news dissemination, and has become the mainstream way of news dissemination. [0003] The current major news portals: "Sina.com", "Xinhuanet", and "NetEase" all have their own powerful teams for news gathering, editing and publishing, and the number of news releases reaches thousands every day. News websites generally cover various categories of news: domestic news, international news, social news, entertainment news, military news, sports news, financial news, technology news, etc. At the same time, each news portal also has its own characteristics, such as "Xinhuanet" for current affairs news, "Sina.com" f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 卜佳俊李辉陈伟陈纯梁雄君
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products