程式師世界 >> 編程語言 >> .NET網頁編程 >> .NET實例教程 >> 使用lucene 3.0.0 索引和檢索中文文件

使用lucene 3.0.0 索引和檢索中文文件

編輯：.NET實例教程

一. 我本來的程序

　　其實我本來的程序挺簡單, 完全修改自Demo裡面的SearchFiles和IndexFiles. 唯一不同的是引用了SmartCN的分詞器.

　　我把修改那一點的代碼貼出來.

　　IndexhChinese.Java:

Date　start　=　new　Date();
try　{
　　IndexWriter　writer　=　new　IndexWriter(FSDirectory.open(INDEX_DIR),　
　　　　　　　　　　new　SmartChineseAnalyzer(Version.LUCENE_CURRENT),　true,　IndexWriter.MaxFIEldLength.LIMITED);
　　indexDocs(writer,　docDir);
　　System.out.println("Indexing　to　directory　'"　+INDEX_DIR+　"'...");
　　System.out.println("Optimizing...");
　　//writer.optimize();
　　writer.close();
　
　　Date　end　=　new　Date();
　　System.out.println(end.getTime()　-　start.getTime()　+　"　total　milliseconds");
　
}　
　　　　SearchChinese.Java
Analyzer　analyzer　=　new　SmartChineseAnalyzer(Version.LUCENE_CURRENT);
　
BufferedReader　in　=　null;
if　(querIEs　!=　null)　{
　　in　=　new　BufferedReader(new　FileReader(querIEs));
}　else　{
　　in　=　new　BufferedReader(new　InputStreamReader(System.in,　"GBK"));
}

　　在這裡, 我制定了輸入的查詢是采用GBK編碼的.

　　然後我充滿信心的運行後......發現無法檢索出中文, 裡面的英文檢索是正常的.

　　二. 發現問題.

　　於是我就郁悶了, 由於對於java與lucene都是太熟悉, 而且用的3.0.0版外面的討論又不是太多, 就瞎折騰了一會兒, 發現我如果把文件的格式另存為ansi就可以檢索中文了(以前是utf-8的), 看來是文件編碼的問題, 摸索了一下, 在indexChinese.Java中發現了如下的代碼:
static　void　indexDocs(IndexWriter　writer,　File　file)
　　throws　IOException　{
　　//　do　not　try　to　index　files　that　cannot　be　read
　　if　(file.canRead())　{
　　　　if　(file.isDirectory())　{
　　　　　　String[]　files　=　file.list();
　　　　　　//　an　IO　error　could　occur
　　　　　　if　(files　!=　null)　{
　　　　　　　　for　(int　i　=　0;　i　<　files.length;　i++)　{
　　　　　　　　　　indexDocs(writer,　new　File(file,　files[i]));
　　　　　　　　}
　　　　　　}
　　　　}　else　{
　　　　　　System.out.println("adding　"　+　file);
　　　　　　try　{
　　　　　　　　writer.addDocument(FileDocument.Document(file));
　　　　　　}
　　　　　　//　at　least　on　Windows,　some　temporary　files　raise　this　exception　with　an　"Access　denIEd"　message
　　　　　　//　checking　if　the　file　can　be　read　doesn't　help
　　　　　　catch　(FileNotFoundException　fnfe)　{
　　　　　　　　;
　　　　　　}
　　　　}
　　}

　　重點在於這一句:

try　{
　　writer.addDocument(FileDocument.Document(file));
}

　　讀取文件的代碼應該就在這裡面, 跟蹤進去:

public　static　Document　Document(File　f)
　　　　　throws　Java.io.FileNotFoundException,　UnsupportedEncodingException　{
　　Document　doc　=　new　Document();
　
　　doc.add(new　Field("path",　f.getPath(),　Field.Store.YES,　FIEld.Index.NOT_ANALYZED));
　
　　doc.add(new　Field("modifIEd",
　　　　　　DateTools.timeToString(f.lastModifIEd(),　DateTools.Resolution.MINUTE),
　　　　　　Field.Store.YES,　FIEld.Index.NOT_ANALYZED));
　　
　　doc.add(new　FIEld("contents",　FileReader(f)));
　
　　//　return　the　document
　　return　doc;
}
　
private　FileDocument()　{}
}

　　這是Lucene的一個內部類, 作用就是從一個文本文件中獲取內容, 生成的Document默認有3個域: path, modifIEd, content, 而content就是文件的文本內容了. 看來是FileReader(f), 這個函數出了問題了, 根本沒有制定采用什麼編碼進行讀取啊, 於是把這兒簡單的修改了一下.

FileIn
putStream　fis=new　FileInputStream(f);
//　　　按照　UTF-8　編碼方式將字節流轉化為字符流
InputStreamReader　isr=new　InputStreamReader(fis,"UNICODE");
//　　　從字符流中獲取文本並進行緩沖
BufferedReader　br=new　BufferedReader(isr);
　
doc.add(new　FIEld("contents",　br));

　　至於那個"Unicode"可以修改為支持的所有編碼, 當我修改為"utf-8"後就可以正常使用了.

　　三. 一些猜測:

　　對於Lucene索引文件的時候, 編碼是沒有關系的, 只要正確指定了, 那麼輸出的文件都是可以被正常檢索到的, 也就是說, 不同的編碼文件索引後的結果一樣(求證)