程式師世界 >> 編程語言 >> .NET網頁編程 >> C# >> C#入門知識 >> 基於Google Search的站內搜索,C#自定義正則解析

基於Google Search的站內搜索,C#自定義正則解析

編輯：C#入門知識

需求如圖:網站裡的一個搜索功能,需要進行站外搜索.這裡我用到了Google 搜索.

自定義搜索地址: http://www.google.com/custom? 用.com的,cn搜索結果頁面裡夾雜了一些廣告,不利於解析.

搜索裡輸入網站+關鍵字,做過seo的朋友都知道site:是什麼意思,這樣搜索的結果全是google收錄的你的網站帶有搜索關鍵字的結果.

搜索的url:

http://www.google.com/custom?hl=en&newwindow=1&q=site:www.hx-soft.cn++檔案數字化&btnG=Google+搜索

默認的參數就是這些,當然還有其它一些參數.

其它參數詳細信息請參考:

http://blog.csdn.net/hean/archive/2008/03/03/2142689.aspx

搜索頁面的結果,我只需要兩部分:

這一部分需要獲取搜索結果的總數,即這裡的42.

這一部分就是主要的搜索結果列表了.要的就是這些,大致步驟先download搜索頁面的html源碼,再通過正則解析獲取自己想要的部分.

首先建一個SearchByGoogle靜態類,添加方法:

/// <summary>
/// 根據url獲取遠程html源碼
/// </summary>
/// <param name="url">搜索url</param>
/// <returns>返回DownloadData</returns>
public static string GetSearchHtml(string url)
{
    WebClient MyWebClient = new WebClient();
    MyWebClient.Credentials = CredentialCache.DefaultCredentials;   //獲取或設置用於對向Internet資源的請求進行身份驗證的網絡憑據。
    Byte[] pageData = MyWebClient.DownloadData(url);                //從指定url下載數據
    return Encoding.UTF8.GetString(pageData);                       //獲取網站頁面采用的是UTF-8
}

這個方法根據url拿到html,就可以開始解析了,添加一個獲取搜索結果總數的方法:

/// <summary>
/// 判斷搜索到結果的總條數
/// </summary>
/// <param name="pageHtml">DownloadData</param>
/// <returns>結果數目</returns>
public static int IsExistResult(string pageHtml)
{
    int count = 0;                                                  //結果數目
    Regex reg = new Regex(@"of(?:sabout)? <b>(d+)</b> from");     //分析結果正則表達式

    if (reg.IsMatch(pageHtml))
    {
        Match m = reg.Match(pageHtml);
        if (m.Groups.Count >= 2)
        {
            count = int.Parse(m.Groups[1].Value);
        }
    }
    return count;
}

因為這裡還涉及到一個對結果的分頁,每頁10條,上面方法可以判斷是否搜索到結果. 通過count總數,可建立分頁參數.

分析url分頁:

http://www.google.com/custom?hl=en&newwindow=1&q=site:www.hx-soft.cn++檔案數字化&start=10&sa=N

這是第二頁,有個參數start=10,第一頁start=0,依次第三頁是start=20

添加方法如下:

/// <summary>
/// 獲取搜索結果分頁起始位置
/// </summary>
/// <param name="count">結果數目</param>
/// <returns>返回包含分頁起始位置的數組</returns>
public static int[] GetPageStarts(int count)
{
    //計算頁碼數
    int pageTotal = 0;
    pageTotal = count % 10 == 0 ? count / 10 : (count / 10) + 1;
    //分頁起始數
    int[] starts = new int[pageTotal];

    for (int i = 0; i < pageTotal; i++)
    {
        starts[i] = (pageTotal - (pageTotal - i)) * 10;
    }
    return starts;
}

這裡建立了一個數組包含分頁的start參數,以備在自己建立的搜索結果頁面添加LinkButton.

最後就是要解析出中間的列表,好在搜索頁面裡沒有Link的樣式,就直接用它的head裡的樣式了,這裡只需要解析出中間那塊.