Method for automatically writing specific manuscript
A specific and manuscript technology, applied in the field of automatic writing of specific manuscripts, can solve the problems of not being able to intelligently learn parameter characteristics and not being directly applicable
Active Publication Date: 2017-05-31
李鹏
3 Cites 3 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0004] 1. The automatic writing of specific manuscripts in the prior art is based on English text and wiki encyclopedias, but in Chinese, natural language processing has many differences in details, so it cannot be directly applied;
[0005] 2. The classification method used in the automatic writing of specific ...
Method used
By this S2 step as can be known, the present invention realizes the clustering process of text document by title similarity, by setting title similarity threshold value text document is filtered, guarantees that the similarity of the text document forming cluster is higher, The effective features for machine learning can be obtained more accurately, ensuring that the effective features are more concentrated and distributed without the disadvantage of scattered distribution. In the present invention, the Levenshtein algorithm can be used to calculate the title similarity of any two text documents. In order to ensure that the similarity of the text documents forming the clusters is higher, the effective features for machine learning can be obtained more accurately, and the effective features are more concentratedly distributed, the similarity threshold in the present invention can be set to 0.5 .
Can know by this S6 step, can obtain first draft article by this step, in order to make the alternative content of crawling more applicable, the present invention also provides following improvement scheme, in step S6, setting crawls described alternative content Length threshold, to crawl the alternative content whose length is greater than or equal to the length threshold, this scheme sets the length threshold of crawling alternative content, and filters the alternative content during crawling, only crawling length is greater than or equal to length Threshold alternative content, so that on t...
Abstract
The invention relates to a method for automatically writing a specific manuscript and belongs to the field of information processing. A text document is subjected to clustering processing according to the title similarity, a VSM is set up based on a TF-IDF, a text is converted into a vector quantity mode, the meaning of a word is considered in the scheme, and Chinese manuscript writing is more reasonable and accurate. MI is utilized for carrying out dimension reduction processing on the vector space model (VSM), and effective features provided for a machine learning classifier are selected; intelligent learning is carried out through the machine learning classifier, a first draft article is obtained, statements of the first draft article are integrated through an ILP processor, repeated sentences in the whole article are automatically removed, and a final draft article better in quality is obtained. When the Chinese manuscript is written, the semanteme and the meaning of the word can be considered, the advantage of being intelligent in learning is achieved, and by means of optimization of the statements and article integration, the method can be suitable for writing specific manuscripts in multiple fields.
Application Domain
Machine learningSpecial data processing applications +1
Technology Topic
tf–idfDegree of similarity +4
Image
Examples
- Experimental program(1)
Example Embodiment
[0051] In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be described in detail below. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other implementations obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
[0052] Glossary:
[0053] TF-IDF, the abbreviation of Term Frequency-Inverse Document Frequency, is called term frequency-inverse document frequency in Chinese. It is a commonly used weighting used for information retrieval and data mining. The main idea of TF-IDF is: if a word or phrase appears in an article with a high frequency of TF and rarely appears in other articles, it is considered This word or phrase has a good ability to distinguish categories.
[0054] VSM, the abbreviation of Vector space model, is called vector space model in Chinese.
[0055] MI, the abbreviation of Mutual Information, is called mutual information in Chinese. In probability theory and information theory, the mutual information of two random variables.
[0056] ILP, the abbreviation of Integer Linear Programming, is called integer linear programming in Chinese, and the variables (all or part of) in the planning are restricted to integers, which is called integer programming. If in the linear model, the variables are restricted to integers, it is called integer linear programming.
[0057] Such as figure 1 As shown, the present invention provides a method for automatically writing a specific manuscript, including the following steps:
[0058] Step S1: Determine the domain of the specific manuscript to be written, and crawl k webpages in the domain of the specific manuscript to be written from the web through a web crawler, k>2, each of the webpages has n subtitles, n≥2, Extract the body of the i-th subtitle and the i-th subtitle of the webpage, and generate the i-th text document, where the i-th subtitle serves as the title of the i-th text document, and the body of the i-th subtitle serves as the i-th text document The body of the text document, i=1, ..., n;
[0059] Through the S1 step, it can be seen that a text library for learning is obtained from the network. The present invention generates a text document according to the subtitle and the body of the subtitle of the obtained webpage. The title of the text document corresponds to a subtitle in the webpage, and the text document The main body is the main body of the subtitle corresponding to the text document in the web page. On the one hand, the text document can be clustered by the title. On the other hand, the main body of the text document and its title have a one-to-one correspondence with each other, so that the text document is clustered After processing, the body of the text document also belongs to a cluster, and the problem of irrelevance between the body of the text document will not appear.
[0060] Step S2: Set a title similarity threshold, compare the title similarity of any two text documents, and process the text documents into multiple clusters, and any two of the texts in each cluster The title similarity of the document is greater than or equal to the title similarity threshold, and each of the clusters uses the title with the highest frequency in the cluster as the name of the cluster;
[0061] Through the S2 step, it can be seen that the present invention implements the clustering of text documents by title similarity, and filters the text documents by setting the title similarity threshold to ensure that the text documents forming clusters are more similar and can be more similar. Accurately obtain the effective features for machine learning to ensure that the effective features are more concentrated and distributed without the shortcomings of scattered distribution. In the present invention, the Levenshtein algorithm can be used to calculate the title similarity of any two text documents. In order to ensure that the similarity of the text documents forming clusters is higher, the effective features for machine learning can be obtained more accurately, and the effective features are more concentratedly distributed, the similarity threshold in the present invention can be set to 0.5 .
[0062] In addition, in step S2 of the present invention, the text documents that do not form clusters may be disturbing to the present invention. In the present invention, the text documents that do not form clusters can be deleted to realize the deletion of interference points. Exclude text documents not related to the present invention.
[0063] Step S3: Count the number of the text documents in each of the clusters, sort each of the clusters according to the number of the text documents in each of the clusters in ascending order, and select the first m of the clusters The name is used as the subtitle of the specific manuscript to be written, where m=(n 1 +n 2 +.....+n k )/k, k represents the number of web pages crawled from the web, n k Indicates the number of subtitles of the k-th webpage;
[0064] From the step S3, it can be known that the subtitle frame of the specific manuscript to be written can be determined through this step. In this step, in order to make the present invention write the most suitable article, the present invention sets clusters according to the number of documents in the clusters. The clusters are sorted in order, and the names of the first m clusters are preferably used as the subtitles of the specific manuscript to be written, so that the subtitles of the specific manuscript of the present invention have the highest matching degree; in addition, this step also provides the specific A preferred solution for the number of subtitles m of the manuscript. In this solution, the number of subtitles of the specific manuscript of the present invention is set by the average of the number of subtitles of the web pages crawled from the web, so that the number of subtitles of the specific manuscript formed by the present invention is set The number of subtitles of articles close to the prior art is set to ensure that the number of subtitles of a specific manuscript of the present invention is reasonable and appropriate.
[0065] Step S4: The text documents in the first m clusters are respectively processed by the TF-IDF algorithm, the feature words of the text documents in each cluster are obtained, and all the text documents in the clusters Establish a vector space model VSM from the text document, use MI to reduce the dimensionality of the vector space model VSM, and select effective features provided to the machine learning classifier;
[0066] It can be seen from the step S4 that the text is converted into a vector mode in this step, so that the present invention considers the meaning and semantics of the word more comprehensively, and overcomes the fact that the acquisition parameters in the prior art are based on the number of words, the number of numbers in the text, etc., and the meaning of the word is ignored. The insufficiency of the impact on classification makes the present invention more reasonable and accurate for Chinese manuscript writing; MI is used to reduce the dimension of the vector space model VSM, and select effective features provided to the machine learning classifier.
[0067] For step S4, the present invention also provides a preferred step method for reducing the dimension of the vector space model VSM space by using MI:
[0068] I.
[0069] II.
[0070] III.
[0071] IV.F=P(t|c i )
[0072] C=P(c i |t)
[0073] In I~IV, f i (t) means in c i The total number of files containing feature t in the cluster, Represents the average number of files containing feature t in each cluster, α represents the balance factor, and F represents the class c i The probability of the word t appearing in, C represents the paragraph with the feature t belonging to class c i The probability.
[0074] For the above-mentioned preferred solution, without considering the influence of word frequency factors, there will be a tendency to blindly find low-frequency words. The present invention also provides the following solutions:
[0075] The method of using MI to reduce the dimension of the vector space model VSM space also includes:
[0076] V.BMI=α*F*C*MI
[0077] BMI represents the final used mutual information standard with balance factor correction for final feature selection. Through this scheme, the influence of word frequency is considered, and the tendency to blindly find low-frequency words is balanced.
[0078] Step S5: After the first m clusters processed in step S4, each cluster corresponds to a unique machine learning classifier, and the text document in each cluster is divided into two parts , Wherein a part of the text document is annotated to train the machine learning classifier; the other part of the text document is used to test the trained machine learning classifier to obtain the corresponding error rate, and compare all the text documents according to the error rate The machine learning classifier is adjusted;
[0079] It can be seen from the step S5 that a supervised machine learning classifier is used for classification in this step. Such classification can learn parameter features more dynamically, and thus obtain more effective and intelligent classification results. In this step, different machine learning classifiers can use the same machine learning algorithm, such as one of the machine learning algorithms such as SVM and Naive Bayes, or different machine learning algorithms.
[0080] Step S6: Constructing the query sentence of the specific manuscript to be written, crawling candidate content from the Internet according to the query sentence, and classifying the crawled paragraphs of the candidate content by the adjusted machine learning classifier , Output the first draft article;
[0081] Through this step S6, it can be known that the first draft article can be obtained through this step. In order to make the crawled candidate content more applicable, the present invention also provides the following improvement scheme. In step S6, the length threshold for crawling the candidate content is set, Crawl the candidate content whose length is greater than or equal to the length threshold. This solution sets the length threshold of the candidate content to be crawled, and filters the candidate content when crawling, and only crawls the backup whose length is greater than or equal to the length threshold. Selecting content, on the one hand, can filter out low-quality alternative content to a greater extent, and on the other hand, it also improves crawling efficiency. In the present invention, in order to make the length threshold setting reasonable and obtain useful alternative content, the length threshold of the alternative content can be set to 15 characters.
[0082] Step S7: The machine learning classifier classifies the candidate content, and when outputting the first draft article, it is determined by the machine learning classifier according to the paragraphs in the candidate content as the machine learning classification The machine learning classifier scores the output paragraph, and uses the score of the paragraph as the score of each sentence in the paragraph;
[0083] According to the scores of each sentence in the first draft article, establish an objective function and a first constraint condition, and integrate the sentences of the first draft article to form a final draft article;
[0084] among them,
[0085] The objective function is:
[0086]
[0087]
[0088] Where Represents sentence indicator variables (the existence of a sentence means 1, and the absence of it means 0), Represents the score of the sentence, s i Represents the sentence number.
[0089] The first constraint is:
[0090]
[0091] Where with Represents two sentences respectively.
[0092] The present invention obtains the first draft article through step S6. Since the first draft article is organized by paragraphs from many different sources, there may be some repeated sentences in these paragraphs. In the automatic writing of a specific manuscript in the prior art, if there are two paragraphs The solution is to delete one of the two paragraphs with repeated sentences, so that it is easy to delete particularly suitable paragraphs, and select only the paragraphs with low similarity but not the most appropriate paragraphs. The prior art is based on paragraphs. The method of dealing with repetitive sentences, the quality of the manuscript obtained is not very high. In order to solve the above-mentioned problem and make the quality of the specific manuscript of the present invention better, the present invention sets step 7 to solve this problem, and automatically filters and removes the repeated sentences in the entire article.
[0093] In step 7, the problem of similarity between the two sentences is solved by the first constraint condition; however, there may be a problem that deleting too many sentences leads to an oversimplified paragraph. For this, the present invention also provides the following improvement solutions, which are specifically: In step S7, the second constraint condition is also included:
[0094]
[0095] In the formula, N(e) represents the total number of sentences in the selected paragraph, and t represents the proportion of sentences that need to be kept at least for each paragraph.
[0096] The above improvement scheme can ensure that at least a certain proportion of sentences in each paragraph can be retained by setting the second constraint condition. Preferably, in the second constraint condition, t can be set to 1/3, so as to ensure that at least 1/3 of the sentences in each paragraph can be retained.
[0097] The specific manuscripts of the present invention may be mobile phone evaluation articles, automobile evaluation articles, real estate promotion articles, and so on.
[0098] The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. It should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.