原文出处:http://blog.chenlb.com/2008/11/htmlcleaner-use-demo.html
<!-- google_ad_section_start -->编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人 觉得 htmlcleaner 比 htmlparser 好用。htmlcleaner 的 xpath特好用。也可能我对htmlparser不熟悉。
htmlcleaner 下载地址:htmlcleaner2_1.jar 源码下载:htmlcleaner2_1-all.zip
写一个测试用的html文件:html-clean-demo.html
- <!DOCTYPE?html?PUBLIC?"-//W3C//DTD?XHTML?1.0?Transitional"?"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd " > ??
- < html ? xmlns = "http://www.w3.org/1999/xhtml " ? xml:lang = "zh-CN" ? dir = "ltr" > ??
- < head > ??
- ????< meta ? http-equiv = "Content-Type" ? content = "text/html;?charset=GBK" /> ??
- ????< meta ? http-equiv = "Content-Language" ? content = "zh-CN" /> ??
- ????< title > html?clean?demo </ title > ??
- </ head > ??
- < body > ??
- < div ? class = "d_1" > ??
- ????< ul > ??
- ????????< li > bar </ li > ??
- ????????< li > foo </ li > ??
- ????????< li > gzz </ li > ??
- ????</ ul > ??
- </ div > ??
- < div > ??
- ????< ul > ??
- ????????< li > < a ? name = "my_href" ? href = "1.html" > text-1 </ a > </ li > ??
- ????????< li > < a ? name = "my_href" ? href = "2.html" > text-2 </ a > </ li > ??
- ????????< li > < a ? name = "my_href" ? href = "3.html" > text-3 </ a > </ li > ??
- ????????< li > < a ? name = "my_href" ? href = "4.html" > text-4 </ a > </ li > ??
- ????</ ul > ??
- </ div > ??
- </ body > ??
- </ html > ??
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr"> <head> <meta http-equiv="Content-Type" content="text/html; charset=GBK"/> <meta http-equiv="Content-Language" content="zh-CN"/> <title>html clean demo</title> </head> <body> <div class="d_1"> <ul> <li>bar</li> <li>foo</li> <li>gzz</li> </ul> </div> <div> <ul> <li><a name="my_href" href="1.html">text-1</a></li> <li><a name="my_href" href="2.html">text-2</a></li> <li><a name="my_href" href="3.html">text-3</a></li> <li><a name="my_href" href="4.html">text-4</a></li> </ul> </div> </body> </html>
模拟需求:取出title,name="my_href" 的链接,div的class="d_1"下的所有li内容。下面用htmlcleaner写代码,HtmlCleanerDemo.java
- package ?com.chenlb;??
- ??
- import ?java.io.File;??
- ??
- import ?org.htmlcleaner.HtmlCleaner;??
- import ?org.htmlcleaner.TagNode;??
- ??
- /** ?
- ?*?htmlcleaner?使用示例. ?
- ?* ?
- ?*?@author?chenlb?2008-11-26?下午02:12:02 ?
- ?*/ ??
- public ? class ?HtmlCleanerDemo?{??
- ??
- ????public ? static ? void ?main(String[]?args)? throws ?Exception?{??
- ????????HtmlCleaner?cleaner?=?new ?HtmlCleaner();??
- ??
- ????????TagNode?node?=?cleaner.clean(new ?File( "html/html-clean-demo.html" ),? "GBK" );??
- ????????//按tag取. ??
- ????????Object[]?ns?=?node.getElementsByName("title" ,? true );???? //标题 ??
- ??
- ????????if (ns.length?>? 0 )?{??
- ????????????System.out.println("title=" +((TagNode)ns[ 0 ]).getText());??
- ????????}??
- ????????System.out.println("ul/li:" );??
- ????????//按xpath取 ??
- ????????ns?=?node.evaluateXPath("//div[@class='d_1']//li" );??
- ????????for (Object?on?:?ns)?{??
- ????????????TagNode?n?=?(TagNode)?on;??
- ????????????System.out.println("\ttext=" +n.getText());??
- ????????}??
- ????????System.out.println("a:" );??
- ????????//按属性值取 ??
- ????????ns?=?node.getElementsByAttValue("name" ,? "my_href" ,? true ,? true );??
- ????????for (Object?on?:?ns)?{??
- ????????????TagNode?n?=?(TagNode)?on;??
- ????????????System.out.println("\thref=" +n.getAttributeByName( "href" )+ ",?text=" +n.getText());??
- ????????}??
- ????}??
- }??
package com.chenlb; import java.io.File; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.TagNode; /** * htmlcleaner 使用示例. * * @author chenlb 2008-11-26 下午02:12:02 */ public class HtmlCleanerDemo { public static void main(String[] args) throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); TagNode node = cleaner.clean(new File("html/html-clean-demo.html"), "GBK"); //按tag取. Object[] ns = node.getElementsByName("title", true); //标题 if(ns.length > 0) { System.out.println("title="+((TagNode)ns[0]).getText()); } System.out.println("ul/li:"); //按xpath取 ns = node.evaluateXPath("//div[@class='d_1']//li"); for(Object on : ns) { TagNode n = (TagNode) on; System.out.println("\ttext="+n.getText()); } System.out.println("a:"); //按属性值取 ns = node.getElementsByAttValue("name", "my_href", true, true); for(Object on : ns) { TagNode n = (TagNode) on; System.out.println("\thref="+n.getAttributeByName("href")+", text="+n.getText()); } } }
cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、 getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner 对不规范的html兼容性比较好。