当前位置: 代码迷 >> Eclipse >> How to make nutch run in eclipse
  详细解决方案

How to make nutch run in eclipse

热度:672   发布时间:2016-04-23 00:39:54.0
How to make nutch run in eclipse ?
Nutch是一个优秀的开源的数据爬取框架,我们只需要简单的配置,就可以完成数据爬取,当然,Nutch里面也提供了很灵活的的插件机制,我们随时都可以对它进行二次开发,以满足我们的需求,本篇散仙,先来介绍下,如何在eclipse里面以local模式调试nutch,只有在eclipse里面把它弄清楚了,那么,我们学习起来,才会更加容易,因为,目前大多数人,使用nutch,都是基于命令行的操作,虽然很简单方便,但是想深入定制开发,就很困难,所以,散仙在本篇里,会介绍下nutch基本的调试,以及编译。


下面进入正题,我们先来看下基本的步骤。
序号名称描述
1安装部署ant编译nutch编码使用
2下载nutch源码必须步骤
3在nutch源码根目录下,执行ant等待编译完成构建nutch
4配置nutch-site.xml必须步骤
5ant eclipse 构建eclipse项目导入eclipse中,调试
6conf目录置顶nutch加载时,会读取配置文件
7执行org.apache.nutch.crawl.Injector注入种子local调试
8执行org.apache.nutch.crawl.Generator生成一个抓取列表local调试
9执行org.apache.nutch.fetcher.Fetcher生成一个抓取队列local调试
10执行org.apache.nutch.parse.ParseSegment执行contet生一个段文件local调试
11配置好solr服务检索服务查询
12执行org.apache.nutch.indexer.IndexingJob映射solr索引local调试
13映射完成后,就可以solr里面执行查询了校验结果


编译完,导入eclipse的中如下图所示,注意conf文件夹置顶:


nutch-site.xml里面的配置如下:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property>  <name>http.agent.name</name>  <value>mynutch</value></property><property>  <name>http.robots.agents</name>  <value>*</value>  <description>The agent strings we'll look for in robots.txt files,  comma-separated, in decreasing order of precedence. You should  put the value of http.agent.name as the first agent name, and keep the  default * at the end of the list. E.g.: BlurflDev,Blurfl,*  </description></property><property>  <name>plugin.folders</name>  <value>./src/plugin</value>  <description>Directories where nutch plugins are located.  Each  element may be a relative or absolute path.  If absolute, it is used  as is.  If relative, it is searched for on the classpath.</description></property></configuration>

下面简单介绍下,在各个类里运行,需要做的一些改动,首先运行nutch,是基于Hadoop的local模式调试的,所以,你得改变下hadoop的权限,否则在运行过程中,会报错。散仙在这里提供一个简单的方法,拷贝hadoop的FileUtils类进行eclipse中,修改它的权限校验即可,如果你是在linux上运行,就不需要考虑这个问题了。

在开始调试之前,你需要在项目的根目录下建一个urls文件夹,并新建一个种子文件放入你要抓取的网址。

在Injector类里面,run方法里,改成

  public int run(String[] args) throws Exception {//    if (args.length < 2) {//      System.err.println("Usage: Injector <crawldb> <url_dir>");//      return -1;//    }	  args=new String[]{"mydir","urls"};//urls    try {      inject(new Path(args[0]), new Path(args[1]));      return 0;    } catch (Exception e) {      LOG.error("Injector: " + StringUtils.stringifyException(e));      return -1;    }  }

在Generator里面的run方法改成
public int run(String[] args) throws Exception {//    if (args.length < 2) {//      System.out//          .println("Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]");//      return -1;//    }	  	  args=new String[]{"mydir","myseg","6","7",""};    Path dbDir = new Path(args[0]);    Path segmentsDir = new Path(args[1]);    long curTime = System.currentTimeMillis();    long topN = Long.MAX_VALUE;    int numFetchers = -1;    boolean filter = true;    boolean norm = true;    boolean force = false;    int maxNumSegments = 1;    for (int i = 2; i < args.length; i++) {      if ("-topN".equals(args[i])) {        topN = Long.parseLong(args[i + 1]);        i++;      } else if ("-numFetchers".equals(args[i])) {        numFetchers = Integer.parseInt(args[i + 1]);        i++;      } else if ("-adddays".equals(args[i])) {        long numDays = Integer.parseInt(args[i + 1]);        curTime += numDays * 1000L * 60 * 60 * 24;      } else if ("-noFilter".equals(args[i])) {        filter = false;      } else if ("-noNorm".equals(args[i])) {        norm = false;      } else if ("-force".equals(args[i])) {        force = true;      } else if ("-maxNumSegments".equals(args[i])) {        maxNumSegments = Integer.parseInt(args[i + 1]);      }    }    try {      Path[] segs = generate(dbDir, segmentsDir, numFetchers, topN, curTime, filter,          norm, force, maxNumSegments);      if (segs == null) return -1;    } catch (Exception e) {      LOG.error("Generator: " + StringUtils.stringifyException(e));      return -1;    }    return 0;  }


在Fetcher的run方法里面改动:
  public int run(String[] args) throws Exception {    String usage = "Usage: Fetcher <segment> [-threads n]";      args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541","4"};//    if (args.length < 1) {//      System.err.println(usage);//      return -1;//    }    Path segment = new Path(args[0]);    int threads = getConf().getInt("fetcher.threads.fetch", 10);    boolean parsing = false;    for (int i = 1; i < args.length; i++) {       // parse command line      if (args[i].equals("-threads")) {           // found -threads option        threads =  Integer.parseInt(args[++i]);      }    }    getConf().setInt("fetcher.threads.fetch", threads);    try {      fetch(segment, threads);      return 0;    } catch (Exception e) {      LOG.error("Fetcher: " + StringUtils.stringifyException(e));      return -1;    }  }

在ParseSegment里面的run方法改动:

 public int run(String[] args) throws Exception {    Path segment;    String usage = "Usage: ParseSegment segment [-noFilter] [-noNormalize]";//    if (args.length == 0) {//      System.err.println(usage);//      System.exit(-1);//    }     args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};    if(args.length > 1) {      for(int i = 1; i < args.length; i++) {        String param = args[i];        if("-nofilter".equalsIgnoreCase(param)) {          getConf().setBoolean("parse.filter.urls", false);        } else if ("-nonormalize".equalsIgnoreCase(param)) {          getConf().setBoolean("parse.normalize.urls", false);        }      }    }    segment = new Path(args[0]);    parse(segment);    return 0;  }

在IndexingJob的run方法里面改动:
  public int run(String[] args) throws Exception {    	args=new String[]{"mydir","D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};        if (args.length < 2) {            System.err                    .println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]");            IndexWriters writers = new IndexWriters(getConf());            System.err.println(writers.describe());            return -1;        }        final Path crawlDb = new Path(args[0]);        Path linkDb = null;        final List<Path> segments = new ArrayList<Path>();        String params = null;        boolean noCommit = false;        boolean deleteGone = false;        boolean filter = false;        boolean normalize = false;        for (int i = 1; i < args.length; i++) {            if (args[i].equals("-linkdb")) {                linkDb = new Path(args[++i]);            } else if (args[i].equals("-dir")) {                Path dir = new Path(args[++i]);                FileSystem fs = dir.getFileSystem(getConf());                FileStatus[] fstats = fs.listStatus(dir,                        HadoopFSUtil.getPassDirectoriesFilter(fs));                Path[] files = HadoopFSUtil.getPaths(fstats);                for (Path p : files) {                    segments.add(p);                }            } else if (args[i].equals("-noCommit")) {                noCommit = true;            } else if (args[i].equals("-deleteGone")) {                deleteGone = true;            } else if (args[i].equals("-filter")) {                filter = true;            } else if (args[i].equals("-normalize")) {                normalize = true;            } else if (args[i].equals("-params")) {                params = args[++i];            } else {                segments.add(new Path(args[i]));            }        }        try {            index(crawlDb, linkDb, segments, noCommit, deleteGone, params,                    filter, normalize);            return 0;        } catch (final Exception e) {            LOG.error("Indexer: " + StringUtils.stringifyException(e));            return -1;        }    }


除此之外,还需要,在SolrIndexWriter的187行和SolrUtils的54行分别添加如下代码,修改solr的映射地址:
String serverURL = conf.get(SolrConstants.SERVER_URL);        serverURL="http://localhost:8983/solr/";

// String serverURL = job.get(SolrConstants.SERVER_URL);    String serverURL ="http://localhost:8983/solr";




按上面几个步骤,每执行一个类的时候,就修改其的运行参数,因为nutch的作业具有依赖性,这一个作业的输入,往往是上一个作业的输出,手动依次运行修改上面的5个类,最终我们的索引就可以生成在solr里,截图如下:




当然,我们还可以,配置分词策略,来使我们检索更加通用,准确.
  相关解决方案