Unlock instant, AI-driven research and patent intelligence for your innovation.

Large text CRF and rule classifying method and system based on full text

A classification system and classification method technology, applied in the field of text processing, can solve problems such as loss of high-level meaning, large text size, and low classification accuracy, so as to improve the overall classification accuracy rate, meet individualized cognition, and achieve high overall accuracy Effect

Active Publication Date: 2017-11-21
北京智通云联科技有限公司
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, if this meaningful part is split into disordered word bags, it will lose its cohesive high-level meaning. If rules are used to classify, the classification accuracy is very low and cannot meet business needs.
[0005] For classification methods purely based on statistics, due to the large text size, such as more than 300,000 words, any statistical method will analyze a large number of statistical features, and optimizing these features under big data will consume a lot of system resources, such as The iterative calculation of the classification model cannot be effectively carried out with a memory exceeding 200G. The calculated model exceeds 5G and will occupy a large amount of memory space during runtime.
Therefore, although the statistical method has the advantage of accuracy, it is limited by computing resources and cannot work effectively and accurately.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large text CRF and rule classifying method and system based on full text
  • Large text CRF and rule classifying method and system based on full text
  • Large text CRF and rule classifying method and system based on full text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] The present invention will be further described in detail below with reference to the accompanying drawings, so that those skilled in the art can implement it with reference to the text of the description.

[0035] It should be understood that terms such as "having", "including" and "including" used herein do not equate the presence or addition of one or more other elements or combinations thereof.

[0036] The present invention provides a full text-based large text CRF and rule classification method, which includes the following steps:

[0037] Split the file to be split into two parts: title text and body text and save them separately;

[0038] CRF text processing method is adopted to process the title text to obtain the corresponding relationship between the file name and the classification sub-version. According to the file name stored in each classification directory, three-level word segmentation is performed and the word segmentation result is classified and labeled, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a large text CRF and rule classifying method and system based on a full text. The full text of a large text is classified by combination of a condition random field and a rule classifying method. A statistics CRF classifying method based on semanteme is used for a headline part of the input large text; a bag-of-word classifying method based on rule is used for a main body part of the large text; and finally, classifying results are integrated, subjected to duplication eliminating and sorted according to a CRF classifying result as a main part and a classifying result as an auxiliary part, and a final classifying result integrated with semantic hierarchy and character hierarchy is output. By the method, the contradiction between highly abstraction of a headline and entity complexity of a main body is solved, understanding on the text at different visual angles is realized, personalized understanding of different people to the large text is met, and the method has the characteristic of high overall accuracy of full text classifying.

Description

Technical field [0001] The invention belongs to the field of text processing, and in particular relates to a method and system for large text CRF and rule classification based on full text. Background technique [0002] Natural language, especially writing, is the main carrier of human knowledge and wisdom. How to dig out useful knowledge from texts and sublimate them into unique insights is the main goal of the current Internet era and the future era of artificial intelligence. As the basic task of natural language processing, classification will occupy a core position in natural language knowledge mining. [0003] Generally, the text on the Internet is relatively short, so the classification and mining of it will not have the problem of low accuracy. However, for some industrial applications, such as the upstream R&D sector of the petroleum industry, the referenced literature for R&D is basically more than 300 pages, and the pictures and texts are extensive, which makes it diff...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 谭培波史晓凌茹海燕
Owner 北京智通云联科技有限公司