Method for analyzing similarity of character string under Web environment

A similarity analysis, string technology, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as similarity error improvement, distortion, etc., to achieve the effect of high applicability and high matching accuracy

Inactive Publication Date: 2009-10-21
NORTHEASTERN UNIV
View PDF0 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this algorithm is only suitable for occasions where abbreviations are common. When comparing non-abbreviations, the similarity between the two words is often wrongly increased, resulting in distortion.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for analyzing similarity of character string under Web environment
  • Method for analyzing similarity of character string under Web environment
  • Method for analyzing similarity of character string under Web environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] An embodiment of the present invention adopts two expressions of the author of a book "Java Programming Thought Fourth Edition" extracted from Deep Web, as follows:

[0046] X = "(US) Bruce Eckel"; Y = "Eckel (US)".

[0047]The specific implementation steps of the inventive method are as follows:

[0048] Step 1. Define basic operation cost, where C(a->a)=0, C(a->b)=1, C(a->ε)=1, C(ε->a)=1.

[0049] Step 2. Preprocess the character string according to the input character string X and character string Y, at first identify the first character of each word in the two character strings, which are respectively ("beauty", "B", "E"), (" E", "Beauty"); Then, after removing the non-meaning characters in X and Y: X="BruceEckelbeauty"; Y="BruceEckelbeauty Eckel".

[0050] Step 3. Create a distance matrix with X as the row and Y as the column, such as figure 1 shown. According to the distance matrix, we can get a matching path: {(美->ε), (B->ε), (r->ε), (u->ε), (c->ε), (e-> ε),...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for analyzing the similarity of character strings under a Web environment, comprising the following steps of: firstly, defining basic operation cost; pretreating the character strings to identify first characters thereof and remove the characters without real meaning; building distance matrix to realize the optimization of an edit distance by building a match index; then judging an abbreviated word, firstly judging whether the two character strings has an abbreviation relationship, if the abbreviation relationship is ensured, the distance is optimized; wherein, the abbreviation relationship is determined by two factors: (1) judging whether the two character strings has the similarity; and (2) judging whether the first characters of the two character strings have been matched; and after that optimizing the distance of the abbreviated word, wherein, the optimization is realized by reducing the cost for decreasing continuously inserted characters and continuously cancelled characters. The method for analyzing the similarity of the character strings can well treat the situations such as omission, abbreviation and character disorder which are usually in Web, has higher application, and under a Wed unknown environment has higher matching precision.

Description

technical field [0001] The invention belongs to the field of computer Web databases, and is particularly suitable for judging the similarity of two records in the repeated record recognition process of a Web database integration system. Background technique [0002] In the web environment, for strings that need to be matched by similarity, spelling mistakes, keyword order reversal, abbreviation or omitted word matching, etc. are often encountered, which leads to the string similarity analysis method applied in the web environment. Many difficulties. Because the typical commonly used string similarity analysis methods are usually only for a specific situation. For example: Levenshteindistance is more suitable for spelling mistakes, and Jaro distance metric is more suitable for abbreviation or omitted word recognition. In applications, it is often necessary to manually judge which algorithm to use in which environment. However, there are mostly semi-structured and unstructu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/22
Inventor 于戈申德荣朱命冬寇月聂铁铮王振华
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products