The invention relates to a webpage
information extraction method and device based on an http protocol. The method comprises the steps of template generation, webpage address analysis,
information extraction, information checking and
information storage, wherein in the template generation step, a corresponding
page analysis template is customized according to a target page where information is about to be extracted, and a target field and checking rules are predefined in the
page analysis template; in the webpage address analysis step, the webpage address of the target page is analyzed to obtain an
HTML source file of the target page; in the
information extraction step, the
HTML source file of the target page is read and analyzed, and page information matched with the target field predefined in the
page analysis template is extracted from the
HTML source file of the target page; in the information checking step, whether the extracted page information meets requirements is checked according to the predefined checking rules; in the
information storage step, the page information subjected to information checking is stored. According to the webpage information extraction method and device, the page information in a network is subjected to effective data
filtration, acquisition and collection through the open http protocol, templates are customized according to different target pages, and extraction of customizing information is achieved.