Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for calculating new words in text based on word frequency matrix feature vectors

A technology of eigenvectors and matrices, which is applied in the field of calculating new words and systems in text based on word frequency matrix eigenvectors, can solve problems such as low accuracy, low efficiency, and high cost, and achieve high accuracy and computational efficiency

Pending Publication Date: 2020-12-18
北京工联科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a method and system for calculating new words in text based on word frequency matrix eigenvectors, so as to solve the problems of high overhead, low efficiency and low accuracy in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for calculating new words in text based on word frequency matrix feature vectors
  • Method and system for calculating new words in text based on word frequency matrix feature vectors
  • Method and system for calculating new words in text based on word frequency matrix feature vectors

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0052]Because the method of the present invention for calculating new words in a text based on the eigenvector of the word frequency matrix can be distributed and parallelized on a large scale, and new words in more than 1 million documents can be mined within one hour. The following takes one of the documents as an example to show the implementation manner of the present invention.

[0053] Calculation of word frequency dictionary of S1 text set

[0054] figure 2 Shown is a screenshot of a piece of network news, in which some network buzzwords (new words) are marked by boxes.

[0055] First preprocess it, remove the punctuation marks in the article, and uniformly replace the punctuation marks with "|", such as image 3 shown.

[0056] Use the conventional word segmentation method to only segment the text, and count the f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method and system for calculating new words in a text based on a word frequency matrix feature vector, and the method mainly comprises the following steps: S1, calculating aword frequency dictionary of a text set; s2, initializing a word frequency matrix; s3, performing dimension reduction based on principal component analysis; s4, performing new word discovery. The system mainly comprises the following modules: a calculation module of a word frequency dictionary of a text set; an initialization module of a word frequency matrix; a dimension reduction module based on principal component analysis; and a new word discovery module. According to the method and system for calculating the new words in the text based on the word frequency matrix feature vectors, the new words in the text can be mined with high accuracy and calculation efficiency.

Description

technical field [0001] The invention relates to a method and a system for calculating new words in a text based on feature vectors of a word frequency matrix, and belongs to the technical fields of natural language processing, data mining, and Chinese word segmentation. Background technique [0002] In the field of Chinese information processing, automatic Chinese word segmentation is a very important basic work. But with the continuous development of society, new words continue to emerge in daily life. The emergence of new words makes too many "scattered strings" appear in the results of automatic word segmentation, which affects the accuracy of word segmentation, such as Wei / Yingluo, Bullet / SMS. According to research, 60% of word segmentation errors are caused by the existence of new words. Therefore, effectively identifying new words will play an important role in observing and analyzing the dynamic changes of language phenomena, standardizing language and characters, a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/289G06F40/284G06F40/242G06F40/216
CPCG06F40/289G06F40/284G06F40/242G06F40/216
Inventor 朱国伟顾维玺吕衎马戈王青春黄启洋
Owner 北京工联科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products