Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for automatically extracting scientific and technical literature data based on text mining

A literature data, automatic extraction technology, applied in electrical digital data processing, natural language data processing, instruments, etc., can solve the problems of slow data accumulation, large time overhead, etc., to improve the accumulation efficiency, strong operability and practicality sexual effect

Pending Publication Date: 2021-12-10
UNIV OF SCI & TECH BEIJING
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention provides a method and system for automatically extracting scientific and technological literature data based on text mining to solve the technical problems of large time overhead and slow data accumulation in the existing scientific and technological literature data collection methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for automatically extracting scientific and technical literature data based on text mining
  • Method and system for automatically extracting scientific and technical literature data based on text mining
  • Method and system for automatically extracting scientific and technical literature data based on text mining

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0054] This embodiment provides a method for automatically extracting scientific and technological literature data based on text mining. Through the preprocessing, word segmentation, text classification, and named entity recognition of the literature corpus, the target entity in the scientific and technological literature corpus can be quickly identified; The semantic relationship between target entities forms an entity relationship, which can capture the events in the sentence of the literature corpus, and then automatically extract the key data information in the scientific and technological literature.

[0055] The method for automatically extracting scientific and technical literature data in this embodiment can be implemented by an electronic device, and the electronic device can be a terminal or a server. The execution flow of this method is as follows figure 1 shown, including the following steps:

[0056] S1, obtaining the file of the data to be extracted; wherein, th...

no. 2 example

[0094] This embodiment provides a method for automatically extracting scientific and technological literature data based on text mining. Through the preprocessing, word segmentation, text classification, and named entity recognition of the literature corpus, the target entity in the scientific and technological literature corpus can be quickly identified; The semantic relationship between target entities forms an entity relationship, which can capture the events in the sentence of the literature corpus, and then automatically extract the key data information in the scientific and technological literature.

[0095]Next, take the automatic extraction of superalloy scientific and technological literature data in the field of material science as an example to illustrate the process of the automatic extraction method for scientific and technological literature data in this embodiment, as shown in figure 2 shown, which includes:

[0096] 1) Document acquisition: automatically deter...

no. 3 example

[0117] This embodiment provides a system for automatically extracting scientific and technological literature data based on text mining, including:

[0118] A document acquisition module, configured to acquire a file of data to be extracted; wherein, the format of the file is XML, HTML or plain text;

[0119] The text preprocessing module is used to extract the plain text content in the file of XML format and HTML format, and filters out the publication information and URL information in the plain text content, and utilizes the plain text content after filtering to form the text corpus;

[0120] The target text screening module is used to filter out sentences containing preset information in the text corpus as target sentences; perform table recognition and table analysis on files in XML format and HTML format, convert table information into nested lists for representation and Filter out the form containing preset information as the target form;

[0121] An entity recognition...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a system for automatically extracting scientific and technical literature data based on text mining. The method comprises the following steps of: obtaining a file (XML (Extensible Markup Language), HTML (Hypertext Markup Language) or plain text) of data to be extracted; extracting plain texts in the XML and HTML files and filtering out publication information and URL (Uniform Resource Locator) information in the plain texts to form a text corpus; screening sentences containing preset information in the text corpus as target sentences; carrying out table recognition and table analysis on the XML and HTML files, and screening a table containing preset information as a target table; performing named entity identification on the target sentence and the target table respectively, identifying target entities contained in the target sentence and the target table, and determining a relationship between the target entities; and splicing the mutually associated target entities in the same literature to form a complete structured data set. According to the scheme, the extraction precision is high, and the whole process is automatic and easy to implement.

Description

technical field [0001] The invention relates to the field of computer application technology, in particular to a method and system for automatically extracting scientific and technological literature data based on text mining. Background technique [0002] Artificial intelligence and machine learning techniques have been successfully applied in many fields of natural sciences, such as biology, medicine, chemistry and materials. A large amount of structured data is a prerequisite for the implementation of artificial intelligence and machine learning technologies. Usually, scientists collect data by manually reading published scientific literature, which is time-consuming and slow in data accumulation. Therefore, it is urgent to develop an automatic extraction technology of scientific and technological literature data to realize the automatic extraction of scientific and technological literature data, and provide a new method for the rapid accumulation and acquisition of scie...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/295G06F40/247G06F40/242G06F40/151G06K9/00
CPCG06F40/295G06F40/247G06F40/242G06F40/151Y02D10/00
Inventor 宿彦京姜雪王伟仁田少晗谢建新
Owner UNIV OF SCI & TECH BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products