A method for identifying the similarity of a large amount of web text information based on word net

A text information and similarity technology, applied in character and pattern recognition, digital data information retrieval, network data retrieval, etc., can solve the problem of not taking into account the ambiguity of end-user query methods, unclear query request target results, and query request content. There are no problems such as pertinence, so as to improve retrieval efficiency and quality of retrieval results, optimize storage and index structure, and eliminate content plagiarism.

Inactive Publication Date: 2021-12-17
SICHUAN NORMAL UNIV
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The above method determines the similarity of the text according to the literal content of the text. The method can return the basic results required by the query request or can compare the similarity of the text from the literal content. Certain ambiguity, that is, the target result that should be returned by the query request is not clear, so that the content of the input query request is not targeted, so that the returned query result may not be the result expected by the user; (2) cannot identify two Although the literal form of the content of the document is very different, the information or meaning contained in it is almost the same. It is just described with different words from different angles, or even a synonymous paraphrase of the same problem.
Havelivala et al. pointed out that the method of identifying relevant webpages based on the webpage link relationship does not have a good effect when the number of links is small. He proposed a combination of anchor text and anchor window methods to make up for the small number of webpage links. Method is easily influenced by the number of links between pages, the type or quality of pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for identifying the similarity of a large amount of web text information based on word net
  • A method for identifying the similarity of a large amount of web text information based on word net
  • A method for identifying the similarity of a large amount of web text information based on word net

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0057] The present invention will be further described below in conjunction with accompanying drawing:

[0058] The method for identifying a large amount of Web text information similarities based on word nets of the present invention comprises the following steps:

[0059] (1) build word net, comprise the following steps:

[0060] 1.1. Extract text information from Web pages to form a document set D composed of multiple documents d, extract feature words from a document d in the document set D, and calculate any two f of all feature words i , f j The normalized mutual information value norm_I between the two ij and norm_I ji , according to the calculated norm_I ij and norm_I ji value to build the feature word f i , f j Mutual information relationship between word pairs i , f j > and j , f i >, norm_I ij As a mutual information relation word pair i , f j > weights, norm_I ji As a mutual information relation word pair j , f i >The weight value, when norm_I ij =nor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for identifying the similarity of a large amount of Web text information based on a word net, comprising the following steps: (1) constructing a word net; (2) identifying the similarity of text information of a new Web page, comprising the following steps: Extract text information to form a new document, and extract feature words f from the new document 1 , f 2 ,...,f m ; Solve the set of similar words of each feature word f; Solve the set of similar documents of each feature word f; Determine the similar documents of the new document and calculate the similarity value of the documents in the similar document set; Filter the documents in the similar document set to get the final (3) update the word net of the new Web page according to the method of step (1). This method can be used to discover information plagiarism or information imitation and tampering, and can be used to discover hidden correlations between different fields, eliminate duplicate web pages, reduce the burden on search engines, and optimize storage and index structures.

Description

technical field [0001] The invention relates to a method for identifying the similarity of Internet text information, in particular to a method for identifying the similarity of a large amount of Web text information based on word nets. Background technique [0002] The transformation of Internet technology not only transmits information and knowledge, but also provides a platform for offline Internet users to release information and communicate with each other, and introduces ordinary users to participate in the rapid growth of a large amount of online information, making the Internet an important part of the information resource library. one. [0003] In order to cope with the rapid growth of Internet information, many research projects focus on how to effectively organize these large amounts of information, so that end users can quickly and accurately obtain the information they need, and reduce the cost of organizing information. Web information in the Internet is displ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/332G06F16/953G06K9/62
CPCG06F18/22
Inventor 靳宇倡安俊秀文仁强
Owner SICHUAN NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products