Patents
Literature
Patsnap Copilot is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Patsnap Copilot

154 results about "Duplicate content" patented technology

Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly duplicate or closely similar. When multiple pages contain essentially the same content, search engines such as Google and Bing can penalize or cease displaying the copying site in any relevant search results.

Chinese web page text deduplication system and method

The invention discloses a Chinese web page text deduplication system and a Chinese web page text deduplication method. The deduplication system comprises an index server and a search server, wherein the index server comprises a web page text preprocessing module, a combined characteristic sentence extraction module and a digital signature calculation module; and the search server comprises a web page text capture module and a Hash query module. The deduplication method comprises the following steps of: normalizing a web page text; extracting a combined characteristic sentence of the text; calculating a digital signature of the combined characteristic sentence; and comparing the digital signature with the existing digital signature in a Hash table, and judging whether the digital signature is duplicated or not. By the deduplication system and the deduplication method, a search engine can quickly and accurately determine and remove a large number of Chinese web pages with duplicated contents in the Internet; and when the search engine captures a new web page, the digital signature of the web page is calculated and compared with the digital signature of the web page, which has been stored by the search engine, whether the web page is duplicated or not is judged, and the web page is not stored if the web page is duplicated, so that the waste of a storage space is avoided, and the search accuracy of the search engine is improved simultaneously.
Owner:SHENGLE INFORMATION TECH SHANGHAI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products