Method for automatically filtering stop words

A technology of automatic filtering and stop words, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of accuracy rate influence, complex and complex reference stop words list, and achieve the effect of improving accuracy

Inactive Publication Date: 2012-07-11
SANDA UNIVERSITY
View PDF0 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Similarly, each language is also advancing with the times, and the intercommunication between languages ​​will also change. Once an article is mixed with several languages, it becomes very complicated to refer to the stop word list up
For example, an article that introduces the relationship between Chinese and English is mixed with Chinese and English, and keywords are extracted from it. If this article is aimed...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically filtering stop words
  • Method for automatically filtering stop words
  • Method for automatically filtering stop words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] refer to figure 1 As shown, a method for automatically filtering stop words according to the invention is disclosed, which is used to filter out stop words in text, the method comprising:

[0035] S10. A preprocessing step. In the preprocessing step, the text is decomposed and categorized to compress the size of the thesaurus. In one embodiment, the preprocessing step S10 includes, referring to figure 2 Shown:

[0036] S20. A sentence decomposing step, decomposing the text into sentences.

[0037] S21. Word decomposing step, further decomposing the sentence into words.

[0038] S22. The part of speech classification step is to classify Arabic numerals, English words, and punctuation marks into numbers, words, and symbols. For example, classify Arabic numerals and English words into (num), (word); unify the sequence of words in the labels that appear in pairs, such as classify the word strings between brackets and book titles into (quote), (book).

[0039] S23. Cla...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for automatically filtering stop words, which is used for filtering out stop words in texts. The method comprises the following steps of: carrying out preprocessing on a text, i.e. carrying out decomposition and generalization on the text so as to compress the size of a word stock; searching absolute stop words and filtering absolute stop words, wherein the absolute stop words include words unassociated with the specific properties of a corpus; searching relative stop words and filtering relative stop words, wherein the relative stop words are expressed by using a natural language but not the combinations of discrete keywords; and dynamically recognizing stop words, i.e. based on the text size of a context associated with a word and the position of the word, calculating the conditional probability that the word is a stop word, recognizing a word (with a conditional probability greater than a conditional probability threshold) as a stop word, and filtering the stop word. By using the method for automatically filtering stop words, disclosed by the invention, the judgment on stop words is implemented based on the analysis on texts, and for different texts, stop words in each text are filtered, so that the filtering accuracy of stop words can be improved.

Description

technical field [0001] The invention relates to the technical field of data retrieval, in particular to a method for automatically filtering stop words. Background technique [0002] Keywords are used as the main basis for retrieval in data retrieval. There are many words in a piece of text, some can be used as keywords, and some obviously cannot be used as keywords. Words that obviously cannot be used as keywords are stop words. Strictly speaking, stop words (Stop Words) are "virtual words and non-retrieval words in computer retrieval". It is mainly used to save storage space and improve efficiency when extracting keywords. It is widely used in search engines and classification technologies. In actual operation, the algorithm will automatically ignore certain words or words, which are called stop words. (Stop Words). [0003] Stop words are equivalent to filter words to a certain extent, but the scope of filter words is larger. Keywords containing sensitive information ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 王宵栋张丽晓
Owner SANDA UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products