The invention discloses a method for extracting information from an academic
home page. The method comprises the following steps of: (1) finding an academic
home page from Internet; (2)
crawling and analyzing the academic
home page, wherein the
crawling of an irrelevant page is reduced by using a
heuristic strategy so as to accelerate analysis speed; (3) analyzing the page into a form of documentobject module (DOM), and dividing according to attributes and contents of elements so as to acquire a cohesive text unit
list; (4) identifying the text unit by using an information recognizer, wherein each information recognizer only identifies one
information type, and performing subfield extraction on the text information; (5) performing association analysis on the extraction result, eliminating different meanings by using the association of the information, and complementing the missing field; and (6) matching the extraction result and a
database, and eliminating the redundant data, wherein the extraction result is stored in a semantic
database in a form of semantic data. In the method, by combination of
heuristic rules, a
machine learning method and a
conditional probability model, academic information can be extracted efficiently and accurately from the academic home page.