Method for automatically acquiring new words from Chinese webpages

A new word and webpage technology, applied in the field of Internet data mining, can solve problems such as low algorithm efficiency, leakage of user privacy, poor Chinese support, etc., and achieve the effect of improving accuracy and processing efficiency

Active Publication Date: 2010-05-12
0 Cites 44 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0008] The first method: User data such as search engine query keywords and chat records are not easy to obtain, and improper use may leak user privacy;
[0009] The second method: search each candidate new word in the search engine, the algorithm efficiency is low, and the applicability is poor;
[0010] The third method: there are defects of low timeliness and incomplete search range of new...
View more

Method used

6) the new word string that step 5) obtains is divided into new word and filter word string by artificial mark again, and filter word string is added to the filter word database of step 4). (what filter word database stores is the word string that this method...
View more


The invention relates to a method for automatically acquiring new words from Chinese webpages and belongs to the technical field of excavating internet data. The method comprises the following steps of: acquiring different types of webpages from the Internet, acquiring texts of webpages containing time information by analysis, pre-treating the texts, performing the n-gram word-segmentation of the obtained sentence segments to generate word strings and accounting word frequencies, and storing the word strings, the word frequencies and the time information of the word strings in an original database; filtering the word strings in the original database by word frequency threshold values, and keeping the word strings of which the word frequencies are more than or equal to the word frequency threshold values; and filtering the kept word strings after the adjacent string comparison and the father-son string comparison of the word strings are carried out, deleting and disabling the same word strings in the word database, and performing time-sequence analysis of the time information of the obtained primarily selected new word strings to obtain new words. The method can also comprises a step of adding the filtering word strings acquired by artificial labeling to the filter word database. The method has the advantages of wide range of acquiring new words, easy and convenient Chinese word-segmentation method, high processing efficiency, and high accuracy and scientificity of finding new words.

Application Domain

Special data processing applications

Technology Topic

Lexical frequencyChinese word +6


  • Method for automatically acquiring new words from Chinese webpages
  • Method for automatically acquiring new words from Chinese webpages
  • Method for automatically acquiring new words from Chinese webpages


  • Experimental program(1)

Example Embodiment

[0055] The method for automatically acquiring Chinese webpage new words proposed by the present invention is described in detail as follows in conjunction with the drawings and embodiments:
[0056] In the method for automatically acquiring new words in Chinese webpages proposed by the present invention, an original database and a stop word database are first set up; the original database is initially set to be empty for storing the data generated during the processing of the new word acquisition method; The stop word database described is pre-stored with words that cannot appear according to Chinese language rules (and can be changed at any time as needed), and used words to be deleted; set the new word acquisition cycle (the length of the cycle can be changed according to actual application needs) If you want to get new words in the near future, you can set the period to be short, otherwise you can set the period to be longer, and you can also make appropriate adjustments according to specific conditions; generally set to 1-30 days),
[0057] The method content is as figure 1 As shown, including the following steps:
[0058] 1) When the new word acquisition cycle arrives, collect different types of web pages from the Internet, parse the body text of the web page containing time information, and preprocess the body text to obtain sentence fragments; specifically including the following steps:
[0059] 11) Collect different types of web pages through shared web crawler programs or RSS acquisition software (for example, use a shared web crawler program to collect web pages of designated news websites and BBS, and use a shared RSS acquisition software to collect designated blog web pages);
[0060] 12) Use commonly used web page analysis software to extract the content of the text and the time information of the text from the web page to obtain the Chinese text on the web page, and then (using the shared file storage software Lemur) to save the Chinese text to the hard disk;
[0061] The collection and analysis of the above-mentioned webpages can also use other software, as long as the software can complete the tasks of collecting and parsing webpages.
[0062] 13) Preprocess the Chinese text, remove webpage tags, replace identifiers, segment text and remove non-Chinese characters (because the obtained webpage body content often contains uncleared webpage tags, identifiers, etc., which affect new word recognition Character, so the text preprocessing is required); specifically include:
[0063] 131) Scan the entire text, and remove all found webpage tags (usually angle brackets appearing in pairs) and their contents from the text.
[0064] 132) Scan the text obtained in step 131), and replace the found webpage identifier with the corresponding characters (identifiers commonly used in webpages include "", "$", "&" and """, respectively with spaces and "$" ", "&" and double quotation mark replacement; other webpage identifiers can also be replaced with corresponding symbols);
[0065] 133) Use the punctuation marks or carriage return and line feed in the text as a sign to segment the text, and divide the text processed in step 132) into sentence fragments;
[0066] 134) Scan each segment of the sentence after segmentation, keep the characters within the encoding range of Chinese characters, and delete other characters (characters are displayed according to a certain encoding method, and unicode encoding is mostly used in web pages. And because The display of characters on the web page is rather messy. Some special characters that cannot be new words will affect the recognition of new words. The encoding range of Chinese characters in unicode is \u4e00-\u9fa5).
[0067] 2) Perform n-gram segmentation on the preprocessed sentence fragments to generate word strings and count the word frequency (that is, the number of times the same word string appears), and store the time information of the word string in the original database; specifically including the following steps:
[0068] 21) Use the n-gram method to divide each sentence fragment after preprocessing, and sequentially gather n adjacent Chinese characters together to form a word string (for example, a sentence "I love China", when n is set to 2, You can get the following three word strings: "I love", "Love in", "China", n can take 1, 2 and 3, or take the value as needed, generally not more than 4);
[0069] 22) Scan all word strings obtained by n-gram segmentation, count the number of occurrences of each word string, and record it as the word frequency of the word string (for example, the number of occurrences of the word string "中国" in the word string generated by n-gram word segmentation);
[0070] 23) All the divided word strings, the statistical word frequency, and the time information of the text extracted in step 12) are stored in the original database together; the original database of this embodiment has two tables, one is the document index table, To store document information, another table is the word string table, which stores word strings and word frequencies according to documents. The table structure of the original database is as follows:
[0071] The structure of the document index table:
[0072] Field Name
[0073] The structure of the word string table:
[0074] Field Name
[0075] 3) Filter the word strings in the original database according to the preset word frequency threshold. The word strings with the word frequency greater than or equal to the word frequency threshold are retained, otherwise they will be deleted from the original database (word string table); the word frequency threshold can be adjusted according to the situation, generally taken The value range is 1-10. In the embodiment, the word frequency threshold that can be set is 1;
[0076] 4) Perform adjacent string comparison and parent-child string comparison on the reserved word string in step 3, and then filter again, and finally delete the same word string in the stop word database to obtain the primary new word string;
[0077] It specifically includes the following steps: Definition: Two word strings with n-1 consecutive words or characters equal and word length n are called adjacent strings (that is, the first (end) character of the first word string and the second The first (end) characters of each entry are different, and the rest of the characters are all the same. For example, the word string "I love" and "爱中" are adjacent strings, and "I love the Chinese People's Republic" and "Love the People's Republic of China" are adjacent string),
[0078] If the longer string contains another shorter string, the longer string is called the parent string, and the shorter string is called the substring (a substring is composed of several consecutive characters in the parent string) Yes, the substring is relative to the parent string, such as "love in" is a substring of "love China");
[0079] 41) If the word frequency of two adjacent strings is the same, both words are deleted. If the word frequency of one word string is higher than the other, the word string with low word frequency is deleted and the word string with high word frequency is retained;
[0080] 42) Scan the reserved word string in step 41, compare the word frequency of each pair of substring and parent string, if the word frequencies of the two are exactly the same, delete the substring and keep the parent string;
[0081] 43) Filter the word string retained in step 42) with the word string in the stop word database to obtain the primary new word string;
[0082] (The stop words in the stop word database are characters determined according to the Chinese language rules. They will not constitute a meaningful word when they appear in a specific position of the word string. Stop words are divided into front stop words, back stop words and generalized stop words. Words. Pre-stop words generally appear at the end of a word, and rarely appear at the beginning of a word, such as "er, Zi, Ran, Yu, Bian, Mo, Men, Hu". The backstop dictionary is the opposite of the front stop word and rarely appears at the end of the word , Such as "Lao, Ah". Generalized stop words can be set as existing words or preset words to be deleted as needed, such as words in general or professional dictionaries. These stop words before and after And generalized stop words constitute a stop word database; the filtering method is; if the first word of the word string is a front stop word, delete the word string. If the last word of the word string is a back stop word, delete the word string Word string. If a word string is a generalized stop word, delete the word string)
[0083] The table structure of the stop word database of this embodiment:
[0084] Field Name
[0085] 5) Perform time sequence analysis on the time information of the new word string in the primary selection to obtain new words; specifically including the following steps:
[0086] 51) Setting: the start date s of the time series analysis, the basic time unit g, the number of basic time units n, and the time series analysis threshold δ. The basic time unit g generally ranges from 1 to 15 days, the number of basic time units n generally ranges from 5 to 30, and the time series analysis threshold δ generally ranges from 0 to 30. (In this embodiment, g is set to 2 days, n is set to 10, and δ is set to 5).
[0087] 52) Read out all the primary new word strings of date s to form a word string set C; for each word string t in C, check its word frequency in g*n days starting from s, and get the word frequency of g*n matrix Data, each g of the word frequency data is aggregated as a group (in this embodiment, the method of finding the arithmetic average is used) to obtain a 1*n matrix word frequency data a 1 , A 2 ,..., a n;
[0088] 53) Let the evaluation function f(a i+1 , A i ): The evaluation function set in this embodiment is as follows:
[0089] f ( a i + 1 , a i ) = 1 , if a i + 1 a i 0 , if a i + 1 = a i - 1 , if a i + 1 a i
[0090] 54) Calculate the value of the sum S of n evaluation functions: If S>δ, it is determined that the primary selected new word string is a new word, otherwise the primary selected new word string is deleted.
[0091] The present invention also proposes another method, which may further include the following content based on the above method:
[0092] Set the filter word database to be initially empty;
[0093] Step 4) also includes deleting the word string if the word string is the same as the word in the current filtered word database
[0094] 6) The new word string obtained in step 5) is divided into a new word and a filter word string by manual labeling, and the filter word string is added to the filter word database of step 4). (The filter word database stores the word strings to be filtered after each operation of this method after human-computer interaction. These word strings are not new words and are difficult to be recognized by the machine. The filter word database can be added incrementally, which can be further improved Get the accuracy of new words.)
[0095] In the embodiment, the table structure of the filter word database is:
[0096] Field Name


no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Polyformaldehyde composite material and preparation method thereof

ActiveCN102417690AOvercome processing difficultiesImprove processing efficiency
Owner:古道尔工程塑胶(深圳)有限公司 +1

Numerical-control gantry tenon milling machine


Laser processing device


Laser cutting method for angle of rotation

ActiveCN101474724AImprove processing efficiencyreduce impenetrable

Preparation method for simulated surfaces

InactiveCN102513697AImproved resistance to fatigue and stress corrosionImprove processing efficiency

Classification and recommendation of technical efficacy words

  • Improve processing efficiency
  • improve accuracy

Full-automatic cloth paving and cutting integrated machine

InactiveCN106592194AImprove processing intelligenceImprove processing efficiency

Golf club head with adjustable vibration-absorbing capacity

InactiveUS20050277485A1improve grip comfortimprove accuracy

Stent delivery system with securement and deployment accuracy

ActiveUS7473271B2improve accuracyreduces occurrence and/or severity

Method for improving an HS-DSCH transport format allocation

InactiveUS20060089104A1improve accuracyincrease benefit

Catheter systems

ActiveUS20120059255A1increase selectivityimprove accuracy

Gaming Machine And Gaming System Using Chips

ActiveUS20090075725A1improve accuracy
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products