Proprietary domain corpus model construction method, computer equipment and storage medium
A construction method and corpus technology, applied in computer equipment and storage media, and in the field of corpus model construction in proprietary fields, to achieve the effects of reducing training costs, high accuracy, and improving accuracy
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0033] This embodiment provides a method for constructing a proprietary domain corpus model, including the following steps:
[0034] Step 1. Corpus collection and preprocessing
[0035] In many industries, such as the financial industry, due to the need for information disclosure, you can find a large number of natural and open PDF files on the Internet, including but not limited to bond prospectuses, prospectuses, investment fund contracts, and equity pledge announcements.
[0036] This step requires parsing and extracting the text in the massive PDF to obtain sufficient pure unsupervised corpus. The specific parsing methods include:
[0037] (1) Maintain the continuity of the text content, divide it by paragraphs, and ensure that the context in the paragraphs is coherent;
[0038] (2) Conversion of traditional and simplified text content, converting all traditional characters into simplified characters;
[0039] (3) The title of the document is taken as a separate paragrap...
Embodiment 2
[0055] This embodiment is on the basis of embodiment 1:
[0056] Suppose a media user uses a crawler to obtain a large number of stock-related documents, and hopes to distinguish which stock each document belongs to through the classification model, and analyze whether it is good or bad.
[0057] Correspondingly, this embodiment provides a method for constructing a proprietary domain corpus model, such as figure 1 shown, including the following steps:
[0058] Step 1. Parse all files, extract the plain text information in PDF, and clean and preprocess the obtained text;
[0059] Step 2. Use the TF-IDF statistical model to analyze the word frequency and inverse frequency of the corpus to obtain proprietary words in the financial field or high-frequency words in specific texts;
[0060] Step 3. According to the high-frequency words in step 2, find the paragraph where it is located, and copy the paragraph 2 times in any part of the text;
[0061] Step 4. Carry out the pre-trai...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com