Unstructured text similarity judgment method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text similarity and unstructured technology, applied in the field of data processing, can solve the problems of leading role, ignoring text semantics, high word frequency weight, etc., and achieve the effect of improving accuracy, improving measurement accuracy, and strong robustness

Pending Publication Date: 2020-12-18

STATE GRID LIAONING ELECTRIC POWER RES INST +3

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

At the same time, some improvements have been made to the final similarity score, but this part of the improvement will bring a significant increase in time complexity

[0009] This scheme is relatively rough and has obvious shortcomings: First, it does not consider the semantics of documents, ignores the contextual relationship of words, and ignores the positional relationship of words, and only judges the similarity from the level of string comparison

[0012] The disadvantage of this algorithm applied to the calculation of text similarity is that it uses the product of TF and IDF as the value measure of the feature space coordinate system, and uses IDF to complete the weighted adjustment of the word frequency TF. The purpose of adjusting the weight is to highlight important words. suppress secondary words

However, when IDF calculates the weight of feature items, it takes the total number of documents in the document set as the benchmark. When the number of documents of various types in the document set is unbalanced, for example, when a certain type of document is relatively small, IDF basically has no inhibitory effect.

Therefore, TFIDF fails to compromise the results of TF and IDF. The weight of this feature item will blindly depend on the document frequency TF, which will eventually lead to a high weight of word frequency or even play a leading role.

At this time, there are obviously major flaws in the measurement of the importance of words, and the number of occurrences of important words may not be many

At the same time, the most important thing is that this algorithm cannot reflect the position information and context information of words, and ignores the semantics of the text. The words that appear in the front position and the words that appear in the back position are both considered to be of the same importance.

[0013] minHash is an implementation of Locality-Sensitive Hashing (LSH), which is used to estimate the similarity of two sets, but compared with other algorithms, such as simhash, it lacks the description of the similarity (simhash uses sea distance to measure similarity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0065] As shown in the figure, the present invention comprises the following steps:

[0066] 1. Input unstructured data

[0067] The unstructured data may be webpages or word documents from web crawlers.

[0068] 2. Text extraction

[0069] Extract textual information from unstructured data. This step uses apache tika (apache organization open source text extraction component) to extract text content, which is compatible with text content extraction in various formats, such as excel, pdf, xml, json, markdown, etc. This step finally outputs the extracted txt file .

[0070] 3. Preprocessing

[0071] In this step, a series of text preprocessing is performed on the txt obtained in the previous step. It includes operations such as removing web page html tags, removing garbled characters, removing special characters, and formatting punctuation marks. This step will output available plain text information.

[0072] 4. Training sentiment classification model

[0073] Step A: P...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the technical field of data processing, and particularly relates to an unstructured text similarity judgment method and system. The invention provides the unstructured text similarity judgment method and system. The specific implementation steps of the scheme are as follows: 1, inputting unstructured data; wherein the unstructured data may be a web page or word document orthe like from a web crawler; 2, extracting text, namely extracting text information from the unstructured data, wherein an apache tika (apache organization open source text extraction component) is adopted to extract text content. The method can be compatible with extraction of text content in various formats, such as excel, pdf, xml, json, markdown and the like, and the extracted txt file is finally output in the step.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a method and system for judging the similarity of unstructured texts. Background technique [0002] Unstructured data refers to the data whose data structure is irregular or incomplete, without a predefined data model, and which is inconvenient to be represented by two-dimensional logical tables of the database. Including web page text, office documents in all formats, text, pictures, XML, HTML, various reports, images and audio / video information, etc. [0003] Text similarity measurement refers to the measurement of the similarity between two texts, which has a wide range of applications in many fields. For example, in information retrieval, similarity can be used to identify similar words and improve the recall rate. In the automatic question answering scenario, the similarity can be used to calculate the matching degree between the user's question sentenc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F40/194G06F40/30G06F40/289G06F16/33G06F16/35G06K9/62G06N3/04G06N3/08

CPCG06F40/194G06F40/30G06F40/289G06F16/3335G06F16/3344G06F16/35G06N3/049G06N3/08G06N3/045G06F18/241

Inventor 胡博李钊李伟雷振江田小蕾王丽霞王大维杨超张智儒王义贺周小明王磊李广翱庄莉梁懿陈新梅曹国强

Owner STATE GRID LIAONING ELECTRIC POWER RES INST

Unstructured text similarity judgment method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology