Method and device for extracting domain keywords

A technology for keywords and domain words, applied in the field of extracting domain keywords, can solve the problems of inability to effectively extract keywords, difficult to give results, and inability to effectively reflect the importance of keywords and the distribution of keywords.

Active Publication Date: 2014-06-18
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF2 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, since TF-IDF itself is a simple weighting that tries to suppress noise, it cannot effectively reflect the importance of keywords and the distribution of keywords, so the accuracy of TF-IDF is not very high in many scenarios, and In many scenarios

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting domain keywords
  • Method and device for extracting domain keywords
  • Method and device for extracting domain keywords

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0021] figure 1 It is a flow diagram of a method for extracting field keywords provided by Embodiment 1 of the present invention. This embodiment is applicable to when the user enters search words through the browser on the terminal to perform information retrieval, and the corresponding information website server extracts field text In the case of identifying the field to which the search term belongs, the method can be executed by a computer device with a field keyword extraction function such as an information website server. see figure 1 , the method specifically includes the following steps 101-103:

[0022] Step 101, generating a domain word frequency matrix composed of word frequencies of word segmentations in each domain description text.

[0023] The information website server may first obtain the description texts of various fields stored locally or obtained by crawling webpages. In this embodiment, the description text of each field can be the text contained in t...

Embodiment 2

[0035] figure 2It is a schematic flowchart of a method for extracting domain keywords provided by Embodiment 2 of the present invention. In this embodiment, on the basis of the above-mentioned embodiments, the step of decoupling the domain word frequency matrix into a low-rank background word frequency matrix and a sparse keyword frequency matrix according to a set algorithm is further described. see figure 2 , the method includes steps 201-206:

[0036] Step 201, generating a field word frequency matrix composed of word frequencies of word segmentations in each field description text.

[0037] Step 202, constructing the domain term frequency matrix as an additive model of the low-rank first term frequency matrix and the sparse second term frequency matrix.

[0038] Step 203, constructing an objective function with the smallest difference between the word frequency matrix in the field and the sum, wherein the restriction of the objective function is: the first word freque...

Embodiment 3

[0063] image 3 It is a schematic structural diagram of a device for extracting domain keywords provided in Embodiment 3 of the present invention. This embodiment is applicable to the situation when the user enters a search word through the browser on the terminal to search for information, and the corresponding information website server extracts the field keywords in the field text to identify the field to which the search word belongs. The specific structure is as follows:

[0064] The domain term frequency matrix generation module 301 is used to generate a domain term frequency matrix composed of the term frequency of each domain description text segmentation;

[0065] The domain word frequency matrix decoupling module 302 is used to decouple the domain word frequency matrix into the sum of the low-rank background word frequency matrix and the sparse keyword word frequency matrix according to the set algorithm;

[0066] The domain keyword extraction module 303 is configu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting domain keywords. The method comprises the following steps of: generating a domain term frequency matrix consisting of term frequencies of words in each domain description text; according to a set algorithm, decoupling the domain term frequency matrix into a sum of a low-rank background word term frequency matrix and a sparse keyword term frequency matrix; and according to the keyword term frequency matrix obtained by decoupling, extracting the keywords in the corresponding domains from the words in each domain description text. According to the technical scheme disclosed by the invention, extraction of the domain keywords can be carried out on the basis of distribution of the frequencies of the words in each domain text in all the domain text and the keywords with representativeness and discrimination in the corresponding domains can be accurately and effectively extracted from each domain text.

Description

technical field [0001] The embodiments of the present invention relate to the technical field of the Internet, in particular to a method and device for extracting domain keywords. Background technique [0002] At present, in some application scenarios, when a user enters a search term through a browser on a terminal device for information retrieval, the corresponding information website server will first identify the field to which the search term belongs based on the pre-set field keywords. , and then send the massive text description content in this field to the terminal device, so as to provide information service for the user. The so-called domain keywords refer to keywords that co-occur in multiple texts of a domain, are most representative of the domain, and have a high degree of distinction from other domains. Domain keywords are widely used in automatic text classification, clustering, resource intelligence services, etc. Therefore, how to reasonably extract domain ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 石磊
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products