Duplication elimination method based on search results of metasearch engine

A meta-search engine and search result technology, which is applied in network data retrieval, other database retrieval, web data retrieval using information identifiers, etc., can solve the problems that the same webpage cannot be deduplicated, and redirected webpages cannot be deduplicated, etc. Achieve obvious effect of de-weighting effect

Inactive Publication Date: 2016-07-27
HARBIN ENG UNIV
View PDF1 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] The purpose of the present invention is to solve the problem that the existing technology cannot deduplicate the same webpage with different URLs, the redirected webpage cannot be deduplicated, the information around the punctuation mark and the position of the punctuation mark cannot represent the information of the entire webpage, and The fuzzy matching of word frequency considered in the prior art cannot fully represent the sentence and article problems, and the proposed method of deduplication based on the search results of the meta search engine

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Duplication elimination method based on search results of metasearch engine
  • Duplication elimination method based on search results of metasearch engine
  • Duplication elimination method based on search results of metasearch engine

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0032] Embodiment 1: The method for removing duplicates based on meta-search engine search results in this embodiment is specifically prepared according to the following steps:

[0033]Step 1: Judging according to the URL of the returned webpage (the result of the search engine search), unifying the URL formats of two or more return webpages, and judging that the two or more return webpages after the unified format are Whether the URL addresses are consistent, if the URL addresses are the same, it is considered a duplicate web page; the judgment based on the URL address is divided into two cases: one is the direct comparison method of the URL address normalization, and the other is the redirection situation for the URL address Judgment method; through the judgment of the above two cases, it is much more comprehensive than directly comparing the URL addresses of web pages;

[0034] Step 2. If it is judged by step 1 that it is not a duplicate web page, go to the next step to jud...

specific Embodiment approach 2

[0045] Embodiment 2: The difference between this embodiment and Embodiment 1 is that the method for direct comparison of URL address normalization in step 1 is specifically:

[0046] The direct comparison method of URL address normalization is very efficient, but some web pages are not in the standard URL format, and some URLs are partially defaulted, so the format of the URL address is unified first, that is, both URLs include Protocol name, host domain name, path and file name four elements; if the URL includes the same protocol name, host domain name, path and file name, the webpage is judged to be a duplicate webpage;

[0047] If there is no file name, add " / index.html", and convert it to ".html" for the suffix ".htm"; for example:

[0048] www.hrbeu.edu.cn

[0049] After normalization, it is

[0050] http: / / www.hrbeu.edu.cn / index.html

[0051] Both point to the same web page, so it is considered to be a duplicate web page, which can detect many situations, such as ...

specific Embodiment approach 3

[0052] Embodiment 3: This embodiment is different from Embodiment 1 or 2 in that: the method for judging the redirection situation of the URL address in step 1 is specifically:

[0053] According to the redirection of the URL address, some web pages are pointed to multiple times in the same website, that is, the URL address is changed, the old URL is redirected to the new URL, the web page file name is the same as the host domain name, the path is different, and the title is the same , the webpage is judged to be a duplicate webpage. Other steps and parameters are the same as in the first or second embodiment.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a duplication elimination method based on search results of a metasearch engine. The duplication elimination method based on the search results of the metasearch engine is provided to solve the problems that in the prior art, duplication elimination cannot be carried out on the same web pages of URLs with different formats and redirected web pages, and sentences and articles cannot be represented fully according to information around punctuation marks, positions of the punctuation marks and fuzzy matching of word frequencies. The method comprises following steps such as 1, judging whether there are duplicated web pages or not according to URL addresses; 2, computing the title similarity of the two web pages and the abstract similarity of the two web pages; 3, computing the similarity of the web pages according to the title similarity and the abstract similarity; and 4, determining that the two web pages are duplicated if a similarity value Sim (A, B) is more than a threshold. The method can be applied to the field of duplication elimination based on the search results of the metasearch engine.

Description

technical field [0001] The invention relates to a method for removing duplicates from search results, in particular to a method for removing duplicates based on meta-search engine search results. Background technique [0002] The rapid development of the Internet has made it more and more convenient for people to obtain information, but how to obtain effective information from the network with a huge amount of data has become a major problem. The emergence of search engines has effectively solved this problem. However, when people use search engines for information retrieval, they often find that many webpages are duplicated or very similar. Too many duplicate webpages will affect the user's query experience, reduce the efficiency of the system, and increase the query time. How to judge It is a big problem to get duplicate webpages out and remove them, because the structure and principle of the meta search engine determines that the repetition rate in the returned results of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/955G06F16/24556G06F16/9535
Inventor 王红滨董宇欣王让李自金刘广强张玉鹏杨楠刘红丽刘天宇冯梦园
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products