当前位置: 代码迷 >> JavaScript >> jsoup 透过网络地址获取内容发送请求
  详细解决方案

jsoup 透过网络地址获取内容发送请求

热度:701   发布时间:2012-10-08 19:54:56.0
jsoup 通过网络地址获取内容发送请求

jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操 作数据。请参考 http://jsoup.org/

?

?

?? ? jsoup的主要功能如下:

?

?? ? ?从一个URL,文件或字符串中解析HTML;

?? ? ?使用DOM或CSS选择器来查找、取出数据;

?? ? ?可操作HTML元素、属性、文本;

?? ? ?jsoup是基于MIT协议发布的,可放心使用于商业项目。

?? ??下载和安装:

?? ? ?maven安装方法:

?? ? ? 把下面放入pom.xml下

?? ? ? ?<dependency>

?? ? ? ? ?<!-- jsoup HTML parser library @ http://jsoup.org/ -->

?? ? ? ? <groupId>org.jsoup</groupId>

?? ? ? ? <artifactId>jsoup</artifactId>

?? ? ? ? <version>1.5.2</version>

?? ? ? ?</dependency>

?? ? ?用jsoup解析html的方法如下:

?? ? ? ?解析url html方法

?

Document
 doc 
=
 
Jsoup
.
connect
(
"http://example.com"
)
? 
.
data
(
"query"
,
 
"Java"
)

? 
.
userAgent
(
"Mozilla"
) 

? 
.
cookie
(
"auth"
,
 
"token"
)

? 
.
timeout
(
3000
)

? 
.
post
();

?

?? ? ?从文件中解析的方法:

?

?

File
 input 
=
 
new
 
File
(
"/tmp/input.html"
);


Document
 doc 
=
 
Jsoup
.
parse
(
input
,
 
"UTF-8"
,
 
"http://example.com/"
);

?

?

??类试js ?jsoup提供下面方法:

?

  • getElementById(String id) ?用id获得元素
  • getElementsByTag(String tag) ?用标签获得元素
  • getElementsByClass(String className) ?用class获得元素
  • getElementsByAttribute(String key) ??用属性获得元素

?

?同时还提供下面的方法提供获取兄弟节点:

?siblingElements() ,?firstElementSibling() ,?lastElementSibling() ;nextElementSibling() ,?previousElementSibling()

用下面方法获得元素的数据:

?

  • attr(String key) ??获得元素的数据
  • attr(String key, String value) ?t设置元素数据
  • attributes() ?获 得所以属性
  • id() ,?className() ? ?classNames() ?获 得id class得值
  • text() 获 得文本值
  • ?text(String value) ?设置文本值
  • html() ?获 取html?
  • ?html(String value) 设置html
  • outerHtml() ?获 得内部html
  • data() 获 得数据内容
  • tag() ??获 得tag 和?tagName() ?获 得tagname

?

操作html提供了下面方法:

?

  • append(String html) ,?prepend(String html)
  • appendText(String text) ,?prependText(String text)
  • appendElement(String tagName) ,?prependElement(String tagName)
  • html(String value)
通过类似jquery的方法操作html
File
 input 
=
 
new
 
File
(
"/tmp/input.html"
);


Document
 doc 
=
 
Jsoup
.
parse
(
input
,
 
"UTF-8"
,
 
"http://example.com/"
);



Elements
 links 
=
 doc
.
select
(
"a[href]"
);
 
// a with href


Elements
 pngs 
=
 doc
.
select
(
"img[src$=.png]"
);

? 
// img with src ending .png



Element
 masthead 
=
 doc
.
select
(
"div.masthead"
).
first
();

? 
// div with class=masthead



Elements
 resultLinks 
=
 doc
.
select
(
"h3.r > a"
);
 
// direct a after h3

?

支持的操作有下面这些:

?

  • tagname 操作tag
  • ns|tag ns或tag
  • #id ?用id获得元素?
  • .class 用class获得元素
  • [attribute] 属性获得元素
  • [^attr] : 以attr开头的属性
  • [attr=value] 属性值为 value
  • [attr^=value] ,?[attr$=value] ,?[attr*=value]?
  • [attr~=regex] 正则
  • * :所以的标签

选择组合

  • el#id el和id定位
  • el.class e1和class定位
  • el[attr]? e1和属性定位
  • ancestor child? ancestor下面的 child
等等
抓取网站标题和内容及里面图片的事例:

?

?

Java 代码 ?收藏代码
  1. public ?? void ?parse(String?urlStr)?{??
  2. ????//?返回结果初始化。 ??
  3. ??
  4. ????Document?doc?=?null ;??
  5. ????try ?{??
  6. ????????doc?=?Jsoup??
  7. ????????????????.connect(urlStr)??
  8. ????????????????.userAgent(??
  9. ????????????????????????"Mozilla/5.0?(Windows;?U;?Windows?NT?5.1;?zh-CN;?rv:1.9.2.15)" )? //?设置User-Agent ??
  10. ????????????????.timeout(5000 )? //?设置连接超 时时间 ??
  11. ????????????????.get();??
  12. ????}?catch ?(MalformedURLException?e)?{??
  13. ????????log.error(?e);??
  14. ????????return ?;??
  15. ????}?catch ?(IOException?e)?{??
  16. ????????if ?(e? instanceof ?SocketTimeoutException)?{??
  17. ????????????log.error(?e);??
  18. ???????????????????????????????return ?;??
  19. ????????}??
  20. ????????if (e? instanceof ?UnknownHostException){??
  21. ????????????log.error(e);??
  22. ????????????return ?;??
  23. ????????}??
  24. ????????log.error(?e);??
  25. ????????return ?;??
  26. ????}??
  27. ????system.out.println(doc.title());??
  28. ????Element?head?=?doc.head();??
  29. ????Elements?metas?=?head.select("meta" );??
  30. ????for ?(Element?meta?:?metas)?{??
  31. ????????String?content?=?meta.attr("content" );??
  32. ????????if ?( "content-type" .equalsIgnoreCase(meta.attr( "http-equiv" ))??
  33. ????????????????&&?!StringUtils.startsWith(content,?"text/html" ))?{??
  34. ????????????log.debug(?urlStr);??
  35. ????????????return ?;??
  36. ????????}??
  37. ????????if ?( "description" .equalsIgnoreCase(meta.attr( "name" )))?{??
  38. ????????????system.out.println(meta.attr("content" ));??
  39. ????????}??
  40. ????}??
  41. ????Element?body?=?doc.body();??
  42. ????for ?(Element?img?:?body.getElementsByTag( "img" ))?{??
  43. ????????String?imageUrl?=?img.attr("abs:src" ); //获 得绝对路径 ??
  44. ????????for ?(String?suffix?:?IMAGE_TYPE_ARRAY)?{??
  45. ????????????if (imageUrl.indexOf( "?" )> 0 ){??
  46. ????????????????imageUrl=imageUrl.substring(0 ,imageUrl.indexOf( "?" ));??
  47. ????????????}??
  48. ????????????if ?(StringUtils.endsWithIgnoreCase(imageUrl,?suffix))?{??
  49. ????????????????imgSrcs.add(imageUrl);??
  50. ????????????????break ;??
  51. ????????????}??
  52. ????????}??
  53. ????}??
  54. }??
	public  void parse(String urlStr) {
		// 返回结果初始化。

		Document doc = null;
		try {
			doc = Jsoup
					.connect(urlStr)
					.userAgent(
							"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)") // 设置User-Agent
					.timeout(5000) // 设置连接超时时间
					.get();
		} catch (MalformedURLException e) {
			log.error( e);
			return ;
		} catch (IOException e) {
			if (e instanceof SocketTimeoutException) {
				log.error( e);
                                return ;
			}
			if(e instanceof UnknownHostException){
				log.error(e);
				return ;
			}
			log.error( e);
			return ;
		}
		system.out.println(doc.title());
		Element head = doc.head();
		Elements metas = head.select("meta");
		for (Element meta : metas) {
			String content = meta.attr("content");
			if ("content-type".equalsIgnoreCase(meta.attr("http-equiv"))
					&& !StringUtils.startsWith(content, "text/html")) {
				log.debug( urlStr);
				return ;
			}
			if ("description".equalsIgnoreCase(meta.attr("name"))) {
				system.out.println(meta.attr("content"));
			}
		}
		Element body = doc.body();
		for (Element img : body.getElementsByTag("img")) {
			String imageUrl = img.attr("abs:src");//获得绝对路径
			for (String suffix : IMAGE_TYPE_ARRAY) {
				if(imageUrl.indexOf("?")>0){
					imageUrl=imageUrl.substring(0,imageUrl.indexOf("?"));
				}
				if (StringUtils.endsWithIgnoreCase(imageUrl, suffix)) {
					imgSrcs.add(imageUrl);
					break;
				}
			}
		}
	}
?

里重点要 提的是怎么获得图片或链接的决定地址:

??如上获得绝对地址的方法String imageUrl = img.attr("abs:src");//获得绝对路径
,前面添加abs:jsoup就会获得决定地址;

想知道原因,咱们查看下源 码,如下:

Java 代码 ?收藏代码
  1. //该方面是先从map中找看是否有该属性key,如果有直接返回,如果没有检查是否 ??
  2. //以abs:开头 ??
  3. ??public ?String?attr(String?attributeKey)?{??
  4. ????????Validate.notNull(attributeKey);??
  5. ??
  6. ????????if ?(hasAttr(attributeKey))??
  7. ????????????return ?attributes.get(attributeKey);??
  8. ????????else ? if ?(attributeKey.toLowerCase().startsWith( "abs:" ))??
  9. ????????????return ?absUrl(attributeKey.substring( "abs:" .length()));??
  10. ????????else ? return ? "" ;??
  11. ????}??
//该方面是先从map中找看是否有该属性key,如果有直接返回,如果没有检查是否
//以abs:开头
  public String attr(String attributeKey) {
        Validate.notNull(attributeKey);

        if (hasAttr(attributeKey))
            return attributes.get(attributeKey);
        else if (attributeKey.toLowerCase().startsWith("abs:"))
            return absUrl(attributeKey.substring("abs:".length()));
        else return "";
    }

?

?接着查看absUrl方法:

?

Java 代码 ?收藏代码
  1. ??????
  2. ??
  3. ??/** ?
  4. ?????*?Get?an?absolute?URL?from?a?URL?attribute?that?may?be?relative?(i.e.?an?<code>&lt;a?href></code>?or ?
  5. ?????*?<code>&lt;img?src></code>). ?
  6. ?????*?<p/> ?
  7. ?????*?E.g.:?<code>String?absUrl?=?linkEl.absUrl("href");</code> ?
  8. ?????*?<p/> ?
  9. ?????*?If?the?attribute?value?is?already?absolute?(i.e.?it?starts?with?a?protocol,?like ?
  10. ?????*?<code>http://</code>?or?<code>https://</code>?etc),?and?it?successfully?parses?as?a?URL,?the?attribute?is ?
  11. ?????*?returned?directly.?Otherwise,?it?is?treated?as?a?URL?relative?to?the?element's?{@link?#baseUri},?and?made ?
  12. ?????*?absolute?using?that. ?
  13. ?????*?<p/> ?
  14. ?????*?As?an?alternate,?you?can?use?the?{@link?#attr}?method?with?the?<code>abs:</code>?prefix,?e.g.: ?
  15. ?????*?<code>String?absUrl?=?linkEl.attr("abs:href");</code> ?
  16. ?????* ?
  17. ?????*?@param?attributeKey?The?attribute?key ?
  18. ?????*?@return?An?absolute?URL?if?one?could?be?made,?or?an?empty?string?(not?null)?if?the?attribute?was?missing?or ?
  19. ?????*?could?not?be?made?successfully?into?a?URL. ?
  20. ?????*?@see?#attr ?
  21. ?????*?@see?java.net.URL#URL(java.net.URL,?String) ?
  22. ?????*/ ??
  23. //看到这里大家应该明白绝对地址是怎么取的了 ??
  24. public ?String?absUrl(String?attributeKey)?{??
  25. ????????Validate.notEmpty(attributeKey);??
  26. ??
  27. ????????String?relUrl?=?attr(attributeKey);??
  28. ????????if ?(!hasAttr(attributeKey))?{??
  29. ????????????return ? "" ;? //?nothing?to?make?absolute?with ??
  30. ????????}?else ?{??
  31. ????????????URL?base;??
  32. ????????????try ?{??
  33. ????????????????try ?{??
  34. ????????????????????base?=?new ?URL(baseUri);??
  35. ????????????????}?catch ?(MalformedURLException?e)?{??
  36. ????????????????????//?the?base?is?unsuitable,?but?the?attribute?may?be?abs?on?its?own,?so?try?that ??
  37. ????????????????????URL?abs?=?new ?URL(relUrl);??
  38. ????????????????????return ?abs.toExternalForm();??
  39. ????????????????}??
  40. ????????????????//?workaround:?java?resolves?'//path/file?+??foo'?to?'//path/?foo',?not?'//path/file?foo'?as?desired ??
  41. ????????????????if ?(relUrl.startsWith( "?" ))??
  42. ????????????????????relUrl?=?base.getPath()?+?relUrl;??
  43. ????????????????URL?abs?=?new ?URL(base,?relUrl);??
  44. ????????????????return ?abs.toExternalForm();??
  45. ????????????}?catch ?(MalformedURLException?e)?{??
  46. ????????????????return ? "" ;??
  47. ????????????}??
  48. ????????}??
  49. ????}??
    

  /**
     * Get an absolute URL from a URL attribute that may be relative (i.e. an <code>&lt;a href></code> or
     * <code>&lt;img src></code>).
     * <p/>
     * E.g.: <code>String absUrl = linkEl.absUrl("href");</code>
     * <p/>
     * If the attribute value is already absolute (i.e. it starts with a protocol, like
     * <code>http://</code> or <code>https://</code> etc), and it successfully parses as a URL, the attribute is
     * returned directly. Otherwise, it is treated as a URL relative to the element's {@link #baseUri}, and made
     * absolute using that.
     * <p/>
     * As an alternate, you can use the {@link #attr} method with the <code>abs:</code> prefix, e.g.:
     * <code>String absUrl = linkEl.attr("abs:href");</code>
     *
     * @param attributeKey The attribute key
     * @return An absolute URL if one could be made, or an empty string (not null) if the attribute was missing or
     * could not be made successfully into a URL.
     * @see #attr
     * @see java.net.URL#URL(java.net.URL, String)
     */
//看到这里大家应该明白绝对地址是怎么取的了
public String absUrl(String attributeKey) {
        Validate.notEmpty(attributeKey);

        String relUrl = attr(attributeKey);
        if (!hasAttr(attributeKey)) {
            return ""; // nothing to make absolute with
        } else {
            URL base;
            try {
                try {
                    base = new URL(baseUri);
                } catch (MalformedURLException e) {
                    // the base is unsuitable, but the attribute may be abs on its own, so try that
                    URL abs = new URL(relUrl);
                    return abs.toExternalForm();
                }
                // workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired
                if (relUrl.startsWith("?"))
                    relUrl = base.getPath() + relUrl;
                URL abs = new URL(base, relUrl);
                return abs.toExternalForm();
            } catch (MalformedURLException e) {
                return "";
            }
        }
    }
?

?

?

  相关解决方案