当前位置: 代码迷 >> Java Web开发 >> lucene为pdf文件建立索引并搜索的有关问题
  详细解决方案

lucene为pdf文件建立索引并搜索的有关问题

热度:5163   发布时间:2013-02-25 21:22:08.0
lucene为pdf文件建立索引并搜索的问题
建立索引的程序如下
Java code
public class testIndexer {    private IndexWriter writer;    public testIndexer(String indexDir) throws CorruptIndexException, LockObtainFailedException, IOException    {        Directory dir = FSDirectory.open(new File(indexDir));        writer = new IndexWriter(dir,new IKAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED);    }    public void close() throws CorruptIndexException, IOException    {        writer.close();    }    public void indexPDFile(String filename) throws Exception    {        File file = new File(filename);        String content = PdfExtractor.getText(file);        Document doc = new Document();        doc.add(new Field("content",content,Field.Store.YES,Field.Index.ANALYZED));        writer.addDocument(doc);    }        public static void main(String args[])    {        String path="k:/aaaa";        String pdfile="k:/kaks.pdf";        try {            testIndexer indx = new testIndexer(path);            indx.indexPDFile(pdfile);         }catch (Exception e) {                // TODO Auto-generated catch block                e.printStackTrace();            }    }}

搜索的程序如下
Java code
public class testSearch {    public static void main(String args[]) throws IOException, ParseException    {        String indexDir = "K:/aaaa";        String q = "分子";        search(indexDir,q);    }    public static void search(String indexDir,String q) throws IOException,ParseException    {        Directory dir = FSDirectory.open(new File(indexDir));        IndexSearcher is = new IndexSearcher(dir,true);        QueryParser parser = new QueryParser(Version.LUCENE_35,"content",new IKAnalyzer());        Query query = parser.parse(q);                TopDocs hits = is.search(query, 10);        for(ScoreDoc scoreDoc:hits.scoreDocs)        {            Document doc = is.doc(scoreDoc.doc);            System.out.println(doc.get("content"));        }        is.close();    }}

建立索引后aaaa文件夹下只有_0.fdt _0.fdx write.lock 三个文件

运行搜索的时候有如下报错
Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@K:\aaaa lockFactory=org.apache.lucene.store.NativeFSLockFactory@ec16a4: files: [write.lock, _0.fdt, _0.fdx]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:712)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:462)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:322)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:110)
at com.index_search.testSearch.search(testSearch.java:24)
at com.index_search.testSearch.main(testSearch.java:19)
想请教一下如何为pdf建立索引,并搜索出其中的关键字呢?

------解决方案--------------------------------------------------------
可以用tika解析
------解决方案--------------------------------------------------------
得用tika解析器提取文本
------解决方案--------------------------------------------------------
你应该是需要做一个关键词和文本地址的关联表,用关键词来做索引,如果你直接做的话,肯定是需要把pdf的内容读出来了。但是如果pdf中都是图片呢?你就没法做了。所以一般来说,建索引都是做专门的映射表来实现的,不会直接的去读文件。
------解决方案--------------------------------------------------------
只有在关闭了writer后索引才能使用
  相关解决方案