Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Mass data-based causal group extraction method and system, and computer readable storage medium

An extraction method and mass data technology, applied in reasoning methods, computer components, calculations, etc., to achieve high reliability, improve accuracy, and reduce noise data

Pending Publication Date: 2022-06-28
GUANGZHOU DATASTORY INFORMATION TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In this prior art, there is no special optimization and improvement for the accuracy and redundancy of causality

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass data-based causal group extraction method and system, and computer readable storage medium
  • Mass data-based causal group extraction method and system, and computer readable storage medium
  • Mass data-based causal group extraction method and system, and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] like figure 1 As shown, a first aspect of the present invention provides a method for extracting causal groups based on massive data, comprising the following steps:

[0037] S1: Obtain network texts and store them separately by time period;

[0038] It should be noted that the network text obtained in the embodiments of the present application is the content disclosed on the Internet. Hadoop distributed storage system HDFS.

[0039] S2: uniformly sample the acquired network text to obtain a sample set and pre-label the sample set;

[0040] The obtained Internet samples are uniformly sampled to obtain a sample set. It should be noted that the number of samples in the sample set should not be too small. In a specific implementation, the number of samples can be about 10,000. Then the sample set is pre-labeled. Pre-labeling is to use the method of keyword and regular matching to mark whether each sample contains causal relationship for the first time. For example, reg...

Embodiment 2

[0059] A second aspect of the present invention provides a system for extracting causal groups based on massive data. The system includes: a memory and a processor, wherein the memory includes a method program for extracting causal groups based on massive data. When the event group extraction method program is executed by the processor, the following steps are implemented:

[0060] S1: Obtain network texts and store them separately by time period;

[0061] It should be noted that the network text obtained in the embodiments of the present application is the content disclosed on the Internet. Hadoop distributed storage system HDFS.

[0062] S2: uniformly sample the acquired network text to obtain a sample set and pre-label the sample set;

[0063] The obtained Internet samples are uniformly sampled to obtain a sample set. It should be noted that the number of samples in the sample set should not be too small. In a specific implementation, the number of samples can be about 10...

Embodiment 3

[0084] This embodiment illustrates the method of the present invention by processing specific triples. For example, in a specific embodiment, the above-mentioned BERT+CRF model is used to perform causal extraction on network text to obtain the following triples:

[0085] [("Cool down by 5 degrees today", 0.55,"It will rain tomorrow"),("Cold wave is coming",0.9,"Down jacket sales increase"),("Wire short circuit",0.7,"Fire broke out"),( "Double Eleven is Coming", 0.65, "The manufacturer's down jacket is out of stock"), ("Cable short circuit", 0.55, "Cause a fire") ("Cold air goes south", 0.85, "Down jacket hot sale"), ("Temperature Dip", 0.55, "Down jacket in short supply"), ("Aging line", 0.9, "Fire hazard"), ("Down jacket in short supply", 0.55, "Temperature drop")]

[0086] Calculate the semantic vector of the above triplet, use the semantic vector to calculate the cosine distance, and use the cosine distance (cosine distance = 1-cosine similarity) as the metric index to clus...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a causal group extraction method and system based on mass data and a computer readable storage medium. The method comprises the following steps: acquiring a web text and storing the web text according to a time period; uniformly sampling the obtained web text to obtain a sample set, and pre-labeling the sample set; performing event labeling in a BIO format and causal relationship labeling on the pre-labeled text set; training the BERT + CRF model by using the data obtained by marking; carrying out causal extraction on the stored web text by utilizing a BERT + CRF model, and forming a triple in a preset format; clustering the triple through a clustering algorithm to obtain a causal group; and performing selection and reduction processing on the obtained causal group, and storing the reduced causal group. According to the method, the causality extraction accuracy is improved, noise data, redundant data and isolated data in an extraction result are reduced, and the method has relatively high reliability.

Description

technical field [0001] The invention belongs to the technical field of event graphs in artificial intelligence natural language processing, and more particularly, relates to a method, system and computer-readable storage medium for extracting causal event groups based on massive data. Background technique [0002] The traditional causality extraction scheme usually mainly considers the extraction of events that contain causal relationships and does not pay much attention to optimizing the accuracy and redundancy of the extracted causal relationships. Existing rule-based or statistical rule-based methods usually need to discover causal relations based on causal relation words, which cannot well discover hidden causal relations. However, the deep learning-based method adopted in this scheme uses the language model BERT pre-trained on large-scale corpus, so it can mine causality from semantic and contextual reasoning to a certain extent. [0003] The prior art discloses a meth...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06N5/04G06F40/30G06K9/62
CPCG06N5/04G06F40/30G06F18/22G06F18/23213
Inventor 杨俊波何宇轩牟昊李旭日徐亚波
Owner GUANGZHOU DATASTORY INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products