通过JAVA的API可以顺利的抓取网络上的大部分指定的网页内容。最简单的一种抓取方法就是:
- URL?url?=?new?URL(myurl); ??
- ??
- BufferedReader?br?=?new?BufferedReader(newInputStreamReader(url.openStream())); ??
- ??
- String?s?=?""; ??
- ??
- StringBuffer?sb?=?new?StringBuffer(""); ??
- ??
- while?((s?=?br.readLine())?!=?null)?{ ??
- ??
- i++; ??
- ??
- sb.append(s+"\r\n"); ??
- ??
- }??
URL url = new URL(myurl); BufferedReader br = new BufferedReader(newInputStreamReader(url.openStream())); String s = ""; StringBuffer sb = new StringBuffer(""); while ((s = br.readLine()) != null) { i++; sb.append(s+"\r\n"); }
?
这种方法抓取一般的网页应该没有问题,但当有些网页中存在一些嵌套的redirect连接时,它就会报Server redirected too many times这样的错误,这是因为此网页内部又有一些代码是转向其它网页的,循环过多导致程序出错。如果只想抓取本URL中的网页内容,而不愿意让它有其它的网页跳转,可以用以下的代码。
?
- URL?urlmy?=?new?URL(myurl); ??
- ??
- HttpURLConnection?con?=?(HttpURLConnection)?urlmy.openConnection(); ??
- ??
- con.setFollowRedirects(true); ??
- ??
- con.setInstanceFollowRedirects(false); ??
- ??
- con.connect(); ??
- ??
- BufferedReader?br?=?new?BufferedReader(new?InputStreamReader(con.getInputStream(),"UTF-8")); ??
- ??
- String?s?=?""; ??
- ??
- StringBuffer?sb?=?new?StringBuffer(""); ??
- ??
- while?((s?=?br.readLine())?!=?null)?{ ??
- ??
- sb.append(s+"\r\n"); ??
- ??
- }??
URL urlmy = new URL(myurl); HttpURLConnection con = (HttpURLConnection) urlmy.openConnection(); con.setFollowRedirects(true); con.setInstanceFollowRedirects(false); con.connect(); BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream(),"UTF-8")); String s = ""; StringBuffer sb = new StringBuffer(""); while ((s = br.readLine()) != null) { sb.append(s+"\r\n"); }
?
?
?
完整的示例代码??? netpc.java :
????? package cn.com.bps.test;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class netpc {
?public static void main(String[] args) {
??
??//要获取的网页地址
??String myurl = "http://localhost:8080/mydomain/index.jsp";
?
??URL urlmy = null;
??HttpURLConnection con =null;
??try {
???urlmy = new URL(myurl);
???con = (HttpURLConnection) urlmy.openConnection();
??//?con.setFollowRedirects(true);
???con.setInstanceFollowRedirects(false);
???con.connect();
???
???BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream(),"UTF-8"));
???String len = "";
???while ((len = br.readLine()) != null) {
????//输出页面上取得的字符串做处理
????System.out.println(len);
???}
???
??} catch (Exception e) {
???// TODO Auto-generated catch block
???e.printStackTrace();
??}
?}
}
?