Method for constructing webpage crawler based on repeated removal of news

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A web crawler and construction method technology, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve problems such as low algorithm efficiency, large waste of resources, and difficulty in data maintenance, and achieve convenient data maintenance and small waste of resources. , the effect of saving storage resources

Active Publication Date: 2010-04-14

ZHEJIANG UNIV

View PDF0 Cites 48 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0011] In order to overcome the disadvantages of low algorithm efficiency in the prior art, easy to grab web pages with repeated content, large waste of resources, and difficult data maintenance, the present invention provides an algorithm with high efficiency, avoiding web pages with duplicate content, and little waste of resources. Construction method of web crawler based on deduplication of news with convenient data maintenance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0041] Refer to attached figure 1 , 2 、4

[0042] A method for constructing a web crawler based on deduplication of news, comprising the following steps:

[0043] 1. A method for constructing a web crawler based on deduplication of news, comprising the following steps:

[0044] 1), construct the parser that can extract the title and content of the news in the webpage, and analyze the news webpage with the parser;

[0045] 2), build the collection of news web pages to form a news collection; set the threshold value of the similarity between the webpage currently grabbed and the news web pages in the news collection, and the similarity is characterized by the degree of repetition of the content;

[0046]3), comparing the currently captured news webpage with the news collection, and judging whether the similarity between them is higher than the threshold;

[0047] (3.1) Extract the keywords in the text and the weight of each keyword from the text of the news title using Chine...

Embodiment 2

[0063] Refer to attached figure 1 , 3 、4

[0064] The difference between the present embodiment and the first embodiment is: if the set C is judged to be non-repetitive news through (3.4), then the news body text is extracted using the Chinese word segmentation technology to extract the keywords and the weight of each keyword in the text, Perform (3.2) to (3.4) in sequence again; if this judgment is still non-repeating news, then add this news to the news set. The rest is the same.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a method for constructing a webpage crawler based on the repeated removal of news, which comprises the following steps: constructing an analyzer for analyzing news webpages; constructing a news set; setting a threshold value of the similarity among the webpages; comparing the currently grabbed news webpage with the news set and judging whether the similarity is higher than the threshold value or not; if the similarity is lower than the threshold value, adding the current webpage into the news set; if the similarity is higher than the threshold value, discarding the news and grabbing the next webpage; grabbing a URL of the current webpage and judging whether the URL points at the news webpage or not, if so, judging whether the URL is accessed or not; otherwise, discarding; if the URL is accessed, discarding the URL; if the URL is not accessed, storing the URL into a queue to be accessed; sequentially extracting the URL from the queue to be accessed to access; and repeatedly executing the steps. The invention has the advantages of high algorithm efficiency, less resource waste and convenient data maintenance and prevents from grabbing the webpage with repeated content.

Description

technical field [0001] The invention relates to a construction method of a webpage crawler, in particular to a component method of a webpage crawler based on deduplication of news. Background technique [0002] In this era of information explosion, Internet media has gradually replaced traditional media such as TV and newspapers with its rapid news release and extensive news dissemination, and has become the mainstream way of news dissemination. [0003] The current major news portals: "Sina.com", "Xinhuanet", and "NetEase" all have their own powerful teams for news gathering, editing and publishing, and the number of news releases reaches thousands every day. News websites generally cover various categories of news: domestic news, international news, social news, entertainment news, military news, sports news, financial news, technology news, etc. At the same time, each news portal also has its own characteristics, such as "Xinhuanet" for current affairs news, "Sina.com" f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

Inventor卜佳俊李辉陈伟陈纯梁雄君

OwnerZHEJIANG UNIV

Method for constructing webpage crawler based on repeated removal of news

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology