爬虫技术之 htmlunit 使用入门_综合

1 htmlunit简介

htmlunit是java实现的开源无界面浏览器，可以有效的加载动态页面。

2 htmlunit的获取

2.1 maven 构建

<dependency><groupId>net.sourceforge.htmlunit</groupId><artifactId>htmlunit</artifactId><version>2.31</version>
</dependency>

2.2 官网下载

下载地址

3 具体使用

3.1获取页面

  //获取页面WebClient webClient=new WebClient();//是否开启js渲染webClient.getOptions().setJavaScriptEnabled(true);HtmlPage page=null;try {page=webClient.getPage("https://mp.csdn.net/");//等待页面渲染完成Thread.sleep(3000);//控制台打印出页面System.out.println(page.asXml());} catch (Exception e) {e.printStackTrace();}

3.2 一些设置

       //是否开启css渲染webClient.getOptions().setCssEnabled(false);//是否开启js渲染webClient.getOptions().setJavaScriptEnabled(true);//是否允许所有人链接(解决https证书不信任问题)webClient.getOptions().setUseInsecureSSL(true);//js失败是否抛出异常webClient.getOptions().setThrowExceptionOnScriptError(false);//是否启用重定向webClient.getOptions().setRedirectEnabled(true);

3.3 执行页面js

      //执行页面js,并获得结果，获取页面中变量_hmt的值ScriptResult t=page.executeJavaScript("_hmt ");System.out.println(t.getJavaScriptResult().toString());

3.4操作dom树，并触发相关事件

      //获取元素 类似js语法的操作方式DomElement domElement= page.getElementById("feedlist_id");try {//触发单击事件，获得新的页面HtmlPage page1= domElement.click();} catch (IOException e) {e.printStackTrace();}

3.5与httpclient相互转换

在爬虫使用时。可能涉及到这两个工具的结合使用，其实转换的核心就是cookie的转换

  //创建httpclient的客户端CookieStore cookieStore = new BasicCookieStore();CloseableHttpClient httpClient = HttpClients.custom().setDefaultCookieStore(cookieStore).build();//获取htmlunit cookie;Set<Cookie> htmlUnitCookies=  webClient.getCookieManager().getCookies();//将htmlunit cookie 转换成htmlclient cookiefor(Cookie cookie:htmlUnitCookies){cookieStore.addCookie(new BasicClientCookie(cookie.getName(),cookie.getValue()));}//获取htmlclient cookieList<org.apache.http.cookie.Cookie> httpClientCookies= cookieStore.getCookies();//cookie 转换for(org.apache.http.cookie.Cookie cookie:httpClientCookies){webClient.getCookieManager().addCookie(new Cookie(cookie.getDomain(),cookie.getName(),cookie.getValue()));}