Author disambiguation method and device based on subject tree clustering

A subject and author's technology, applied in text database clustering/classification, instrumentation, unstructured text data retrieval, etc., can solve the problems of new data training, large amount of data, low applicability, etc., to improve accuracy, The effect of improving quality

Active Publication Date: 2020-06-02
BEIHANG UNIV
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the existing unsupervised learning algorithms are not very applicable, such as requiring a large amount of data labeling or retraining for new data.
For electronic databases that are constantly being updated, this means retraining operations with high frequency and large data volumes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Author disambiguation method and device based on subject tree clustering
  • Author disambiguation method and device based on subject tree clustering
  • Author disambiguation method and device based on subject tree clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0045] like Figure 1 to Figure 3 As shown, the author disambiguation method provided by the present invention includes the following steps:

[0046] 1. Obtain text data with the author of this article

[0047] The processor receives from the input the text data with the author of the article that needs to be disambiguated. In the embodiment of the present invention, the text data is described by taking papers as an example, which are called papers to be classified.

[0048]A set Ak={A1,...,An} of people with the same name with a given name K is stored in the memory, where K, k, and n are all natural numbers, and k∈K. Each element A1,...,An in Ak represents a realistic author with the same name but different persons, that is, there are n authors with a given name (or name number) Ak.

[0049] Given a collection of papers to be classified P={P1,...,Pn}, the author of each paper contains a given name Ak corresponding to the collection A of people with the same name. That is,...

no. 2 example

[0102] On the basis of the first embodiment, the following steps can be added to solve the situation that there is no author with the same name under a certain subject or.

[0103] 6. Determine whether there is an author with the same name under the subject node, if not, go to the next step; if so, determine that the author with the same name is the author of the text data;

[0104] 7. Match the candidate authors with the same name with each subject node in turn to calculate the matching degree;

[0105] Match each candidate author in the set Ak of ​​persons with the same name with the subject one by one, and calculate the matching degree.

[0106] 8. Select a candidate author who is in the same discipline as the text data as the author of this article; if there is no candidate author of the same subject, it is judged that there is no author with the same name, and the author of this article is connected to the subject.

[0107] The numbering of steps 6 to 8 above is just to ...

no. 3 example

[0109] On the basis of the first embodiment, the following steps can be added to solve the situation that there are at least two authors with the same name under a certain subject (that is, the authors with the same name are different people, but the research directions of the two are the same, and they are the same subject).

[0110] 9. Determine whether there is one and only one author with the same name under the subject node, if not, go to the next step; if yes, determine that the author with the same name is the author of the text data;

[0111] 10. Determine whether there is no author with the same name in the same subject, if so, connect the author of this article to the subject, and add the author of this article to the candidate author of the subject; if not (indicating that there are multiple authors with the same name in the subject), Then go to the next step.

[0112] 11. Select the author of the paper with the highest matching degree in step 5 as the author of thi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an author disambiguation device based on subject tree clustering. The author disambiguation device comprises the following steps: obtaining text data with authors; processing the text data to extract key information; extracting representative words according to the text data; sequentially matching the text data with each subject node of the subject tree based on the representative words and the key information of the text data, and calculating a matching degree; and selecting the subject node with the highest matching degree, connecting the subject node with the text data, and taking the author with the same name under the subject node as the author. According to the method, the subject tree is constructed, clustering calculation is carried out on the basis of the subject tree to eliminate the problem of text data classification errors under the condition of different persons with the same name, the accuracy of author recognition of the text data is improved, then the text retrieval quality is improved, and an effective auxiliary analysis means is provided for computer semantic analysis.

Description

technical field [0001] The invention relates to an author disambiguation method based on subject tree clustering and a corresponding author disambiguation device, belonging to the field of computer semantic analysis. Background technique [0002] In various common online search systems, according to the statistics of Google and Yahoo, the method of name search accounts for 5-10% of all search requests. However, existing search engines treat names as ordinary character strings, and do not deal with a large number of duplicate name mechanisms that occur in name retrieval. For example, if you try to query JingZhang in DBLP, the returned results include 54 papers, belonging to 25 different authors with the same name. The name ambiguity problem includes two different sub-problems: the same person with different names and the same name with different people. [0003] Traditional statistics-based machine learning methods are generally divided into two categories: supervised learn...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06K9/62G06F40/295G06F40/216
CPCG06F16/35G06F18/23
Inventor 张辉王德庆黄宏鸣郝瑞
Owner BEIHANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products