[0055] The method for automatically acquiring Chinese webpage new words proposed by the present invention is described in detail as follows in conjunction with the drawings and embodiments:
[0056] In the method for automatically acquiring new words in Chinese webpages proposed by the present invention, an original database and a stop word database are first set up; the original database is initially set to be empty for storing the data generated during the processing of the new word acquisition method; The stop word database described is pre-stored with words that cannot appear according to Chinese language rules (and can be changed at any time as needed), and used words to be deleted; set the new word acquisition cycle (the length of the cycle can be changed according to actual application needs) If you want to get new words in the near future, you can set the period to be short, otherwise you can set the period to be longer, and you can also make appropriate adjustments according to specific conditions; generally set to 1-30 days),
[0057] The method content is as figure 1 As shown, including the following steps:
[0058] 1) When the new word acquisition cycle arrives, collect different types of web pages from the Internet, parse the body text of the web page containing time information, and preprocess the body text to obtain sentence fragments; specifically including the following steps:
[0059] 11) Collect different types of web pages through shared web crawler programs or RSS acquisition software (for example, use a shared web crawler program to collect web pages of designated news websites and BBS, and use a shared RSS acquisition software to collect designated blog web pages);
[0060] 12) Use commonly used web page analysis software to extract the content of the text and the time information of the text from the web page to obtain the Chinese text on the web page, and then (using the shared file storage software Lemur) to save the Chinese text to the hard disk;
[0061] The collection and analysis of the above-mentioned webpages can also use other software, as long as the software can complete the tasks of collecting and parsing webpages.
[0062] 13) Preprocess the Chinese text, remove webpage tags, replace identifiers, segment text and remove non-Chinese characters (because the obtained webpage body content often contains uncleared webpage tags, identifiers, etc., which affect new word recognition Character, so the text preprocessing is required); specifically include:
[0063] 131) Scan the entire text, and remove all found webpage tags (usually angle brackets appearing in pairs) and their contents from the text.
[0064] 132) Scan the text obtained in step 131), and replace the found webpage identifier with the corresponding characters (identifiers commonly used in webpages include "", "$", "&" and """, respectively with spaces and "$" ", "&" and double quotation mark replacement; other webpage identifiers can also be replaced with corresponding symbols);
[0065] 133) Use the punctuation marks or carriage return and line feed in the text as a sign to segment the text, and divide the text processed in step 132) into sentence fragments;
[0066] 134) Scan each segment of the sentence after segmentation, keep the characters within the encoding range of Chinese characters, and delete other characters (characters are displayed according to a certain encoding method, and unicode encoding is mostly used in web pages. And because The display of characters on the web page is rather messy. Some special characters that cannot be new words will affect the recognition of new words. The encoding range of Chinese characters in unicode is \u4e00-\u9fa5).
[0067] 2) Perform n-gram segmentation on the preprocessed sentence fragments to generate word strings and count the word frequency (that is, the number of times the same word string appears), and store the time information of the word string in the original database; specifically including the following steps:
[0068] 21) Use the n-gram method to divide each sentence fragment after preprocessing, and sequentially gather n adjacent Chinese characters together to form a word string (for example, a sentence "I love China", when n is set to 2, You can get the following three word strings: "I love", "Love in", "China", n can take 1, 2 and 3, or take the value as needed, generally not more than 4);
[0069] 22) Scan all word strings obtained by n-gram segmentation, count the number of occurrences of each word string, and record it as the word frequency of the word string (for example, the number of occurrences of the word string "中国" in the word string generated by n-gram word segmentation);
[0070] 23) All the divided word strings, the statistical word frequency, and the time information of the text extracted in step 12) are stored in the original database together; the original database of this embodiment has two tables, one is the document index table, To store document information, another table is the word string table, which stores word strings and word frequencies according to documents. The table structure of the original database is as follows:
[0071] The structure of the document index table:
[0072] Field Name
[0073] The structure of the word string table:
[0074] Field Name
[0075] 3) Filter the word strings in the original database according to the preset word frequency threshold. The word strings with the word frequency greater than or equal to the word frequency threshold are retained, otherwise they will be deleted from the original database (word string table); the word frequency threshold can be adjusted according to the situation, generally taken The value range is 1-10. In the embodiment, the word frequency threshold that can be set is 1;
[0076] 4) Perform adjacent string comparison and parent-child string comparison on the reserved word string in step 3, and then filter again, and finally delete the same word string in the stop word database to obtain the primary new word string;
[0077] It specifically includes the following steps: Definition: Two word strings with n-1 consecutive words or characters equal and word length n are called adjacent strings (that is, the first (end) character of the first word string and the second The first (end) characters of each entry are different, and the rest of the characters are all the same. For example, the word string "I love" and "爱中" are adjacent strings, and "I love the Chinese People's Republic" and "Love the People's Republic of China" are adjacent string),
[0078] If the longer string contains another shorter string, the longer string is called the parent string, and the shorter string is called the substring (a substring is composed of several consecutive characters in the parent string) Yes, the substring is relative to the parent string, such as "love in" is a substring of "love China");
[0079] 41) If the word frequency of two adjacent strings is the same, both words are deleted. If the word frequency of one word string is higher than the other, the word string with low word frequency is deleted and the word string with high word frequency is retained;
[0080] 42) Scan the reserved word string in step 41, compare the word frequency of each pair of substring and parent string, if the word frequencies of the two are exactly the same, delete the substring and keep the parent string;
[0081] 43) Filter the word string retained in step 42) with the word string in the stop word database to obtain the primary new word string;
[0082] (The stop words in the stop word database are characters determined according to the Chinese language rules. They will not constitute a meaningful word when they appear in a specific position of the word string. Stop words are divided into front stop words, back stop words and generalized stop words. Words. Pre-stop words generally appear at the end of a word, and rarely appear at the beginning of a word, such as "er, Zi, Ran, Yu, Bian, Mo, Men, Hu". The backstop dictionary is the opposite of the front stop word and rarely appears at the end of the word , Such as "Lao, Ah". Generalized stop words can be set as existing words or preset words to be deleted as needed, such as words in general or professional dictionaries. These stop words before and after And generalized stop words constitute a stop word database; the filtering method is; if the first word of the word string is a front stop word, delete the word string. If the last word of the word string is a back stop word, delete the word string Word string. If a word string is a generalized stop word, delete the word string)
[0083] The table structure of the stop word database of this embodiment:
[0084] Field Name
[0085] 5) Perform time sequence analysis on the time information of the new word string in the primary selection to obtain new words; specifically including the following steps:
[0086] 51) Setting: the start date s of the time series analysis, the basic time unit g, the number of basic time units n, and the time series analysis threshold δ. The basic time unit g generally ranges from 1 to 15 days, the number of basic time units n generally ranges from 5 to 30, and the time series analysis threshold δ generally ranges from 0 to 30. (In this embodiment, g is set to 2 days, n is set to 10, and δ is set to 5).
[0087] 52) Read out all the primary new word strings of date s to form a word string set C; for each word string t in C, check its word frequency in g*n days starting from s, and get the word frequency of g*n matrix Data, each g of the word frequency data is aggregated as a group (in this embodiment, the method of finding the arithmetic average is used) to obtain a 1*n matrix word frequency data a 1 , A 2 ,..., a n;
[0088] 53) Let the evaluation function f(a i+1 , A i ): The evaluation function set in this embodiment is as follows:
[0089] f ( a i + 1 , a i ) = 1 , if a i + 1 a i 0 , if a i + 1 = a i - 1 , if a i + 1 a i
[0090] 54) Calculate the value of the sum S of n evaluation functions: If S>δ, it is determined that the primary selected new word string is a new word, otherwise the primary selected new word string is deleted.
[0091] The present invention also proposes another method, which may further include the following content based on the above method:
[0092] Set the filter word database to be initially empty;
[0093] Step 4) also includes deleting the word string if the word string is the same as the word in the current filtered word database
[0094] 6) The new word string obtained in step 5) is divided into a new word and a filter word string by manual labeling, and the filter word string is added to the filter word database of step 4). (The filter word database stores the word strings to be filtered after each operation of this method after human-computer interaction. These word strings are not new words and are difficult to be recognized by the machine. The filter word database can be added incrementally, which can be further improved Get the accuracy of new words.)
[0095] In the embodiment, the table structure of the filter word database is:
[0096] Field Name