Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Semantic relevancy calculation method of document

A technology of semantic relevance and calculation method, applied in the field of semantic relevance calculation of documents, can solve the problems of loss of semantic information, decrease of system output accuracy, small scope, etc., to achieve simple calculation method, ensure accuracy, and enhance adaptability Effect

Active Publication Date: 2016-01-27
SHENZHEN GIISO INFORMATION TECH
View PDF6 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The construction of dictionaries and domain knowledge bases takes a lot of time and manpower. More importantly, the scope of information covered by dictionaries is not large, only including vocabulary in specific domains, and the knowledge base is even more incomplete.
The correlation calculation method based on search engines relies on the results returned by external search engines, which cannot guarantee the stability of system output, nor does it support offline calculation of a large number of text sets
[0012] Second, although the traditional bag-of-words model-based method is simple in principle and implementation, it will perform poorly for short texts and when there are significant polysemy or polysemous words in the text
The limitation of this type of method lies in the dependence on the corpus, that is, the document set in the similar field of the text whose semantic relevance is examined, to learn the topic distribution of the vocabulary
Usually, in real application scenarios, this kind of corpus is not readily available, or requires a certain amount of manpower to sort out and organize
[0014] Fourth, due to the large increase in the number of Wikipedia concept articles, more and more different concepts for specific topics will be mapped in the explicit semantic analysis, and they appear in the representation vector of the text at the same time, but due to the explicit semantic analysis The similarity relationship between concept articles is not considered, and the two concept articles will participate in the calculation as separate vector elements in the calculation of semantic correlation
As a result, this part of the semantic information is lost, resulting in a decrease in the accuracy of the system output

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semantic relevancy calculation method of document
  • Semantic relevancy calculation method of document
  • Semantic relevancy calculation method of document

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment 1

[0089] 1) Database creation. The main data sheet is shown in Table 1.

[0090] Table 1 System main data table

[0091] Table Name

[0092] 2) Establish a mapping from words to wiki concept vectors and store them in the data table. Obtain the latest or newer version from the backup database of Wikipedia, extract the full-text information and directory link information of concept articles from the dump xml file, and store them in the wikipage and categorylinks tables respectively.

[0093] 3) Carry out word stemming (for English) or word segmentation (for Chinese, the word segmentation program used is the Chinese Academy of Sciences automatic word segmentation system ICTCLAS) of the text of the Wiki concept article, and use the full-text index tool lucene for reverse indexing, and store it in the data table termindex middle. So far, the words that have appeared in Wikipedia concept articles have a corresponding representation vector, the elements of which are compo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a semantic relevancy calculation method of a document. The semantic relevancy calculation method comprises the following steps: carrying out data preprocessing; establishing mapping from a term to a Wiki concept vector in a relationship database; inputting a first text and a second text, and independently taking the Wiki concept vectors corresponding to all terms in the first text and the second text, wherein the first text and the second text need to carry out semantic relevancy calculation; constructing a hierarchical Wiki catalogue; independently mapping the Wiki concept vectors to the Wiki catalogue to construct Wiki catalogue vectors; and through the Wikipedia catalogue vectors, calculating the semantic relevancy of the first text and the second text. The semantic relevancy calculation method of the document is based on a calculation framework of the semantic relevancy of the Wiki concept and the Wiki catalogue, meanwhile, the semantic relevancies on different abstraction levels are considered and are organically combined to improve the calculation precision of the semantic relevancy, and meanwhile, a favorable man-machine interaction mechanism and a favorable scheduling policy are provided.

Description

technical field [0001] The invention relates to the field of information retrieval, in particular to a method for calculating semantic relevance of documents. Background technique [0002] With the rapid development of social media and mobile Internet, a large number of information resources, including text information, are generated and accumulated at an accelerated rate. Text information can be expressed and transmitted through natural language, which is the main carrier of human knowledge and the main medium of human communication. However, the rapid generation and mass accumulation of information makes it more and more difficult to spend manpower to read and process it. This kind of work has become unrealistic in many scenarios, such as web page retrieval, text classification, etc. Using machines to help people process this information more quickly and efficiently has become a challenge in both academia and industry. Technologies such as information retrieval, machine...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/24522G06F16/284G06F40/30
Inventor 郑海涛吴文箴赵从志
Owner SHENZHEN GIISO INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products