建立索引的程序如下
- Java code
public class testIndexer { private IndexWriter writer; public testIndexer(String indexDir) throws CorruptIndexException, LockObtainFailedException, IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir,new IKAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws CorruptIndexException, IOException { writer.close(); } public void indexPDFile(String filename) throws Exception { File file = new File(filename); String content = PdfExtractor.getText(file); Document doc = new Document(); doc.add(new Field("content",content,Field.Store.YES,Field.Index.ANALYZED)); writer.addDocument(doc); } public static void main(String args[]) { String path="k:/aaaa"; String pdfile="k:/kaks.pdf"; try { testIndexer indx = new testIndexer(path); indx.indexPDFile(pdfile); }catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } }}
搜索的程序如下
- Java code
public class testSearch { public static void main(String args[]) throws IOException, ParseException { String indexDir = "K:/aaaa"; String q = "分子"; search(indexDir,q); } public static void search(String indexDir,String q) throws IOException,ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir,true); QueryParser parser = new QueryParser(Version.LUCENE_35,"content",new IKAnalyzer()); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); for(ScoreDoc scoreDoc:hits.scoreDocs) { Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("content")); } is.close(); }}
建立索引后aaaa文件夹下只有_0.fdt _0.fdx write.lock 三个文件
运行搜索的时候有如下报错
Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@K:\aaaa lockFactory=org.apache.lucene.store.NativeFSLockFactory@ec16a4: files: [write.lock, _0.fdt, _0.fdx]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:712)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:462)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:322)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:110)
at com.index_search.testSearch.search(testSearch.java:24)
at com.index_search.testSearch.main(testSearch.java:19)
想请教一下如何为pdf建立索引,并搜索出其中的关键字呢?
------解决方案--------------------------------------------------------
可以用tika解析
------解决方案--------------------------------------------------------
得用tika解析器提取文本
------解决方案--------------------------------------------------------
你应该是需要做一个关键词和文本地址的关联表,用关键词来做索引,如果你直接做的话,肯定是需要把pdf的内容读出来了。但是如果pdf中都是图片呢?你就没法做了。所以一般来说,建索引都是做专门的映射表来实现的,不会直接的去读文件。
------解决方案--------------------------------------------------------
只有在关闭了writer后索引才能使用