当前位置: 代码迷 >> Eclipse >> nutch1.2爬虫在eclipse上运行遇到的有关问题
  详细解决方案

nutch1.2爬虫在eclipse上运行遇到的有关问题

热度:90   发布时间:2016-04-23 02:18:12.0
nutch1.2爬虫在eclipse下运行遇到的问题

????? 最近在研究nutch,将爬虫的源码导入eclipse。参照apache的一个wiki进行了配置。

?

http://wiki.apache.org/nutch/RunNutchInEclipse1.0

?

? 可是运行起单元测试起来会报出异常:

?

?

2011-05-27 11:15:46,747 WARN? regex.RegexURLNormalizer (RegexURLNormalizer.java:setConf(113)) - Can't load the default config file! regex-normalize.xml
2011-05-27 11:15:46,760 INFO? conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - prefix-urlfilter.txt not found
2011-05-27 11:15:46,773 INFO? conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - suffix-urlfilter.txt not found
2011-05-27 11:15:46,775 WARN? suffix.SuffixURLFilter (SuffixURLFilter.java:readConfigurationFile(175)) - Missing urlfilter.suffix.file, all URLs will be rejected!
2011-05-27 11:15:46,785 INFO? conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - regex-urlfilter.txt not found
2011-05-27 11:15:46,786 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: regex-urlfilter.txt
2011-05-27 11:15:46,794 INFO? conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - automaton-urlfilter.txt not found
2011-05-27 11:15:46,795 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: automaton-urlfilter.txt
2011-05-27 11:15:46,800 WARN? domain.DomainURLFilter (DomainURLFilter.java:setConf(135)) - Attribute "file" is not defined in plugin.xml for plugin urlfilter-domain
2011-05-27 11:15:46,801 INFO? conf.Configuration (Configuration.java:getConfResourceAsReader(968)) - found resource domain-urlfilter.txt at file:/boot/wx-zone/nutch_all/bin/domain-urlfilter.txt
2011-05-27 11:15:46,868 WARN? domain.DomainSuffixes (DomainSuffixes.java:<init>(47)) - java.net.MalformedURLException
??? at java.net.URL.<init>(URL.java:601)
??? at java.net.URL.<init>(URL.java:464)
??? at java.net.URL.<init>(URL.java:413)
??? at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
??? at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
??? at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
??? at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
??? at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
??? at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
??? at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
??? at org.apache.nutch.util.domain.DomainSuffixesReader.read(DomainSuffixesReader.java:54)
??? at org.apache.nutch.util.domain.DomainSuffixes.<init>(DomainSuffixes.java:44)

?

显示的是一些配置文件txt没有装载,可是在命令行模式下是可以运行的。

?

我最后的解决方法是将爬虫根目录下的所有配置文件复制到? src/test???? package下一份,解决了。看来nutch的测试对于test来说是依赖很大。 比较混乱。

  相关解决方案