Theme word extraction method, and method and device for obtaining related digital resource by using same

A digital resource and extraction method technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as polysemous words, synonym interference, poor robustness, etc., to achieve enhanced robustness, improved accuracy, improved The effect of accuracy

Active Publication Date: 2016-01-06
NEW FOUNDER HLDG DEV LLC +2
View PDF11 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Therefore, the technical problem to be solved by the present invention is to overcome the interference of polysemous words and synonyms in the process of subject word extraction in the prior art, and it is necessary to manually edit the characteristic words or the subject term candidate list, and adopt the named entity technology to determine the defect of the subject term candidate words , so as to provide a method and device for extracting subject words
[0006] Another technical problem to be solved by the present invention is to overcome the disadvantages in the prior art that vector space models and named entity recognition are required to be used for topic generation, and the robustness is poor, so as to provide a method and device for obtaining related digital resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theme word extraction method, and method and device for obtaining related digital resource by using same
  • Theme word extraction method, and method and device for obtaining related digital resource by using same
  • Theme word extraction method, and method and device for obtaining related digital resource by using same

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] A method for extracting subject words is provided in this embodiment, which is used to extract subject words in digital resources. The digital resources here can be one file or multiple files. After pre-selecting the digital resources, for the selected digital resources to extract keywords. The flow chart of the method is as figure 1 shown, including the following steps:

[0049] S11. Segment the text of the digital resource.

[0050] After digital resources are selected, the set of selected digital resources is positioned as D={d 1 , d 2 ,...,d m}, where d i , i=1,...,m represent the i-th news text, and m can be 1. Load the user dictionary to segment a single news text. The user dictionary is a collection of words composed of idioms, abbreviations and new words. Its function is to add some special terms in specific fields, such as idioms, abbreviations and new words, to improve the accuracy of word segmentation by the tokenizer. It is defined as userLib ={e 1 ...

Embodiment 2

[0072] This embodiment provides a method for obtaining related digital resources, which is used to obtain digital resources related to the selected digital resources among the massive digital resources. First, select the first digital resource. The first digital resource can be One article may also be multiple digital resources belonging to one topic. The purpose of this embodiment is to find out other digital resources related to the first digital resource. The flow chart of the method is as figure 2 shown, including the following steps:

[0073] S21. Using the method in Embodiment 1 to extract the subject words of the first digital resource. After the first digital resource is selected, the method in Example 1 is used to extract the subject words of the first digital resource, which will not be repeated here. Through the method in Example 1, the subject term vector of the first digital resource can be obtained topicWords=(tterm 1 ,t term 2 ,...,tterm q ), where tterm ...

Embodiment 3

[0099] This embodiment provides a topic generation method, which is used to obtain files in the resource library that belong to the same topic as the files read by the user according to the interested files that the user has read, and push these topics to the user to increase user experience. The flow of the topic generation method is as follows image 3 shown, including the following steps:

[0100] S31. Select a first digital resource. Here, digital resources that the user is interested in or concerned about can be selected, or some digital resources that the user has read. This step is used to select reference information, and the first digital resource is reference information for subsequent processing.

[0101] S32. Select one candidate digital resource in sequence as the second digital resource. A digital resource is selected from the candidate resource library as the second digital resource for subsequent processing.

[0102] S33. Use the method described in Embodime...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a theme word extraction method, and a method and a device for obtaining related digital resources by using the same. The theme word extraction method comprises: firstly, performing word segmentation on a text of digital resource, and then obtaining content words according to a word segmentation result; aimed at each theme, obtaining probability distribution of the content words, the probability distribution comprising the content words and corresponding weight thereof; obtaining each meaning of the content words, combining the content words in the same meaning and combining the corresponding weight; and according to the combined content words and the weight thereof, determining the theme words. The scheme views from an angle of the meaning of a word, and the words in the same meaning are combined, so as to prevent interference of polysemic words and synonyms on extraction of the theme words in the prior art, and improve accuracy of extraction of the theme words. The method eliminates dependence on selection of feature words and identity of named entities in the prior art, weakens interference of polysemic words and synonyms on extraction of the theme words, and a user oriented customized special subject organization and generation thereof are realized.

Description

technical field [0001] The invention relates to the field of digital resource processing, in particular to a method for extracting subject words, and a method and device for obtaining related digital resources. Background technique [0002] With the rapid development of the Internet, digital newspapers are becoming more and more popular, which greatly enhances the interaction between users and newspapers, and provides the possibility for the organization and generation of personalized newspapers and periodicals. In addition, a large number of new news reports are added every day across the country, most of which are new events and accompanied by a large number of new words. The so-called "new words" mainly refer to words with new content and new form, which do not exist in the original vocabulary system or have completely new meanings. [0003] In order to better describe these digital resources and facilitate the subsequent recommendation and retrieval of related topics, i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 许茜叶茂任彩红徐剑波汤帜
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products