Name disambiguation method and system based on LightGBM classification and representation learning

A name and binary classification technology, applied in the information field, can solve problems such as limitations

Active Publication Date: 2022-01-21
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
View PDF9 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When these methods capture the similarity between papers, the features involved are limited to semantics, co-authorship or co-occurrence of authors, etc.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Name disambiguation method and system based on LightGBM classification and representation learning
  • Name disambiguation method and system based on LightGBM classification and representation learning
  • Name disambiguation method and system based on LightGBM classification and representation learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

[0041]The invention is oriented to scientific literature data, and proposes a disambiguation algorithm based on a supervised learning algorithm and representation learning for the phenomenon of authors having the same name in the literature. Among them, the supervised learning part adopts the LightGBM (hereinafter referred to as LGB) binary classification model. Specifically, the meta-information and inter-paper association information of papers in the training set are extracted through feature engineering, and the LGB algorithm is used to train a binary classification model to determine whether any two papers belong to the same author. The representation learning part refers to the word2vec text semantic representation method and the meta-path-b...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a LightGBM classification and representation learning-based name disambiguation method and a LightGBM classification and representation learning-based name disambiguation system for scientific literature data and aiming at author homonymy phenomena in literatures. According to the supervised learning part, meta-information features of papers in a training set and associated information features among the papers are extracted by utilizing feature engineering, a positive example and negative example sample pair data set is constructed through sampling and serves as input of a LightGBM dichotomy model, and model output serves as the probability that the two papers belong to the same author. The representation learning part refers to a word2vec text semantic representation method and a meta-path-based relation network representation method to capture semantic information of papers and relation characteristics between the papers. And finally, based on the output of the supervision model and the representation learning model, cluster division is performed on the to-be-disambiguated paper set by using a hierarchical clustering algorithm to realize homonymy disambiguation. According to the method, high expandability and stability can be achieved on the premise that the accuracy rate and the recall rate are not lost, parallel calculation can be completely achieved, and the execution efficiency is improved.

Description

technical field [0001] The invention belongs to the field of information technology, and in particular relates to a name disambiguation method and system based on LightGBM classification and representation learning. Background technique [0002] Name disambiguation is considered a focal task in the field of scientific literature data. It is mainly used in literature data management, analysis, scholar retrieval, and construction of scholar social network. With the rapid increase in the number of scientific literature in recent years, the number of scholars has also increased, and the rate of duplicate names of scholars is getting higher and higher, which makes the task of disambiguation of the same name face a huge challenge. Previously, many solutions for the name disambiguation task have been proposed at home and abroad. Due to the multi-source data and the complexity of the application scenarios, there is still room for optimization in the disambiguation method. [0003]...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/62G06F40/30G06F40/289G06F16/36G06N20/00
CPCG06F40/289G06F40/30G06F16/367G06N20/00G06F18/231G06F18/22G06F18/24323G06F18/214
Inventor 董昊宁致远杜一周园春
Owner COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products