[PHP] Zend_Search_Lucene中文分詞實做－danielhuang030 的研究日誌

最近在練習中有用到搜尋的功能。一般對MySQL資料庫作搜尋，常用的做法是針對資料表中的特定欄位，用「%」LIKE的方式去尋找。然而這樣的做法常伴隨著許多限制，使用者必須先選定所要輸入的資料欄位，再對其進行搜尋；習慣了Google搜尋所帶來的便利，最理想的方式是只有一個輸入格，且可以在此輸入格中任意輸入，即可對整個資料庫進行搜尋。在MySQL中稱為Full-Text（全文檢索）；然而拜完Google大神以後，網路上前輩們幾乎是一面倒的否定全文檢索。最主要的原因是因為它不支持中文！

全文檢索的做法，即是對資料庫裡的資料進行「分詞」的索引處理，有了索引，搜尋起來自然有效率的多；然而中文字不同於英文，一個句子中單獨一個中文字就可能有它的意思，另一個最大的分別在於中文句子可不像英文句子由單字與「空格」組成；建立索引時的「分詞」的動作，就是以空格進行判斷！

全文檢索的問題在網路上一直存在著，但是前輩們似乎都沒有非常完美的解答；甚至有人直接勸退提問者：「全文檢索的功能，是可以讓你寫好幾篇博士論文的研究！」如此可見，Google雲端運算的強大。既然資料庫端無解，我就從PHP的方向著手吧～Google的確是大神，讓我找到了Zend Framework就有分詞的函式Zend Search Lucene；而且原本這個功能其實也不支持中文，萬能的Google大神還幫我找到了支持中文的解決方法！

在PHP製作中文全文搜尋不求人中，作者巨細靡遺的說明了資料夾配置、原始程式碼，還非常貼心的提供範例程式的下載。在如何讓Zend_Search_Lucene支持中文分詞中，作者改良了分詞用的類別，讓中文分詞的動作更加準確！既然有如此完整的範例，我當然是馬上適用在練習中啦！

目前練習所利用的開發框架，是Jace一手打造的Wacow Framework。結合Zend Framework與Smarty，以及許多在專案製作時常會用到的工具，是公司目前專案開發的主力，也是我目前需要熟練的工具。我把建立索引的功能放在首頁，這樣只要有人進入首頁就會觸發建立索引（當然這樣做的代價是每次進首頁的速度都會被拖慢）。因為自動載入類別的關係，在建立索引時不需再額外include或require，程式碼如下：

IndexController.php（部分節錄）

// 建立分詞索引
// 關閉 Notice 錯誤提醒
error_reporting(E_ALL ^ E_NOTICE);
// 資料是 utf8 為編碼的這句為重點。如果你是 utf8 的話必需加入，否則資料會錯誤；另Phpbean是需要另外建立的中文分詞的類別
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Phpbean());
if (function_exists("set_time_limit") && ! get_cfg_var('safe_mode')) {
set_time_limit(0);
}
$index = new Zend_Search_Lucene('index', true);
$itemTable = new Items();
$itemRowset = $itemTable->fetchAll();
foreach ($itemRowset as $itemRow) {
$url = '/gime/item/detail/id/' . $itemRow->id; // 建立連結
$itemName = $itemRow->name; // 抓出物品名稱
$description = $itemRow->description; // 抓出物品敘述
//儲存網頁的位置以在搜尋結果中連結.
$doc = new Zend_Search_Lucene_Document(); // 建立新的索引文件
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', strtolower($url)));
$doc->addField(Zend_Search_Lucene_Field::Text('name', strtolower($itemName), 'utf-8'));
$doc->addField(Zend_Search_Lucene_Field::Text('contents', strtolower($description), 'utf-8'));
$index->addDocument($doc); //把索引文件加到索引中
}
$index->commit(); //提交，及保存索引

Phpbean.php

class Phpbean extends Zend_Search_Lucene_Analysis_Analyzer_Common
{
private $_position;

private $_cnStopWords = array();

public function setCnStopWords($cnStopWords){
$this->_cnStopWords = $cnStopWords;
}

/**
* Reset token stream
*/
public function reset()
{
$this->_position = 0;
$search = array(",", "/", "\", ".", ";", ":", """, "!", "~", "`", "^", "(", ")", "?", "-", "'", "", "$", "&", "%", "#", "@", "+", "=", "{", "}", "[", "]", "：", "）", "（", "．", "。", "，", "！", "；", "“", "”", "‘", "’", "［", "］", "、", "—", "　", "《", "》", "－", "…", "【", "】","的");
$this->_input = str_replace($search,' ',$this->_input);
$this->_input = str_replace($this->_cnStopWords,' ',$this->_input);
}

/**
* Tokenization stream API
* Get next token
* Returns null at the end of stream
*
* @return Zend_Search_Lucene_Analysis_Token|null
*/
public function nextToken()
{
if ($this->_input === null) {
return null;
}
$len = strlen($this->_input);
while ($this->_position while ($this->_position _input[$this->_position] == ' ') {
$this->_position++;
}
$termStartPosition = $this->_position;
$temp_char = $this->_input[$this->_position];
$isCnWord = false;
if (ord($temp_char) > 127) {
$i = 0;
while ($this->_position _input[$this->_position]) >127) {
$this->_position = $this->_position + 3;
$i ++;
if ($i == 2) {
$isCnWord = true;
break;
}
}
if ($i == 1) continue;
} else {
while ($this->_position _input[$this->_position])) {
$this->_position++;
}
// echo $this->_position.":".$this->_input[$this->_position]."\n";
}
if ($this->_position == $termStartPosition) {
$this->_position++;
continue;
}

$token = new Zend_Search_Lucene_Analysis_Token(
substr($this->_input,
$termStartPosition,
$this->_position - $termStartPosition),
$termStartPosition,
$this->_position);
$token = $this->normalize($token);
if ($isCnWord) $this->_position = $this->_position - 3;
if ($token !== null) {
return $token;
}
}
return null;
}
}

SearchController.php（部分節錄）

Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Phpbean());
$index = new Zend_Search_Lucene('index');
$query = $this->_request->getParam('query'); // 擷取來自表單的關鍵字
$query = trim($query);
if (strlen($query) > 0) {
try {
$query2 = Zend_Search_Lucene_Search_QueryParser::parse(strtolower($query), "utf-8");
$hits = $index->find($query2); // 根據關鍵字找到資料，並將資料回存為物件
}
catch (Zend_Search_Lucene_Exception $ex) {
$hits = array();
}
$numHits = count($hits); // 根據關鍵字找到資料的數量
}

非常簡單的實作；然而就全文檢索來說還是有缺點的！首先，建立分詞索引時必定會耗費系統資源，故比較好的做法是批次定時處理建立索引的動作。第二點是中文的問題，因為中文字詞與連貫的句子的關係，在分詞時是以二個字為一個詞的最基本單位，所以單一個中文字是不會有任何搜尋結果的。最後因為建立分詞索引為觸發事件，如果沒有去觸發它就無法更新分詞至目前資料庫的最新狀態。我覺得分詞索引的方式很像是MySQL的View資料表，也是將資料表中的欄位作一個資料上的更新，只是它沒有欄位的限制，可以針對建立的「詞」索引進行搜尋。就某方面來說是很好用的功能，也不失為中文在全文索引時的一種解決方案。^^