Method and device for author naming disambiguation and electronic equipment

An author, disambiguation technology, applied in metadata text retrieval, special data processing applications, unstructured text data retrieval, etc., can solve problems such as slow speed, insufficient precision, and complex naming disambiguation.

Pending Publication Date: 2020-11-13
北京智源人工智能研究院
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, for a large data set including hundreds of millions of data such as AMiner (the AMiner data set itself has 130 million experts and 200 million papers, compared with DBLP and PubMed, the data volume is dozens of times), The author's naming disambiguation is a more complex work, which must consider both the acc

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for author naming disambiguation and electronic equipment
  • Method and device for author naming disambiguation and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0040] Such as figure 1 As shown, the embodiment of the present invention provides a method for author name disambiguation, including:

[0041] S101, according to the relevant information of the paper, using the pre-trained classification model to determine the unique author of the paper from the academic data set;

[0042] S102, for a paper whose only author cannot be determined, search the academic data set to obtain a set of candidate papers by using relevant information of the paper;

[0043] S103, clustering the papers in the candidate thesis set to obtain multiple categories, performing reverse classification on the papers in the candidate thesis set to determine their categories, and creating a unique author for the papers according to the categories.

[0044] The method provided in this embodiment integrates classification and clustering processing for disambiguation. Classification processing is used as the threshold of clustering processing, which effectively solves...

Embodiment 2

[0075] Such as figure 2 As shown, another aspect of the present invention also includes a functional module architecture completely corresponding to the aforementioned method flow, that is, an embodiment of the present invention also provides a device for author name disambiguation, including:

[0076] The unique author determination module 201 is used to determine the unique author of the paper from the academic data set by using the pre-trained classification model according to the relevant information of the paper;

[0077] The candidate collection of papers acquisition module 202 is used to search academic data sets using the relevant information of the papers to obtain the candidate collection of papers for papers for which the only author cannot be determined;

[0078] The unique author creation module 203 is used to cluster the papers in the candidate papers to obtain multiple categories, and perform reverse classification on the papers in the candidate papers to deter...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for author naming disambiguation and electronic equipment. The method comprises the steps of determining a unique author of a paper from an academic dataset by utilizing a pre-trained classification model according to related information of the paper; searching the academic data set to obtain an alternative paper set by utilizing related information of papers for papers for which a unique author cannot be determined; and clustering the papers in the alternative paper set to obtain a plurality of categories, performing reverse classification on thepapers in the alternative paper set to determine the categories where the papers are located, and creating unique authors for the papers according to the categories. In actual work, the method provided by the invention is adopted to perform naming disambiguation on a big data set, and an efficient and extensible effect is achieved on the premise of not losing recall and precision. Therefore, themethod provided by the invention provides an effective solution for naming disambiguation of an oversized data set.

Description

technical field [0001] The invention relates to the technical field of electronic data processing, in particular to a method, device and electronic equipment for author name disambiguation. Background technique [0002] Author name ambiguity is a frequently encountered problem in academic datasets such as digital libraries. The main reason for this problem is that different authors may publish papers with the same name, and the same author may publish papers with different names due to reasons such as abbreviations and nicknames. Name disambiguation is the key to solve this problem. Name disambiguation has been regarded as a challenging problem in many applications such as document management in digital libraries, scholarly search, and social network analysis. [0003] Currently, name disambiguation is usually performed independently using clustering or classification algorithms. For example, the Gradient Boosted Trees classification method proposed by Kunho Kim achieved ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06F16/38
CPCG06F16/35G06F16/38
Inventor 宋健唐杰刘德兵高博仇瑜鄢兴雨陈波张惠聪
Owner 北京智源人工智能研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products