程式師世界 >> 編程語言 >> .NET網頁編程 >> C# >> C#入門知識 >> 使用C#開發百度空間驗證碼自動填寫的工具

使用C#開發百度空間驗證碼自動填寫的工具

編輯：C#入門知識

百度貼吧的驗證碼是通過js調用的，好像有點ajax的意思，具體沒有搞太明白。
當我們的光標焦點在了回復編輯框時，觸發了一個onfocus事件，onfocus事件的腳本將驗證碼的輸入框的display屬性改為true，之後當我們將光標移動到驗證碼輸入框之後又觸發了一個onfocus事件，同時js腳本將驗證碼現實出來。看完這個之後整個過程也就基本明了了，我們要想自動識別驗證碼並填寫，首先我們要將現實驗證碼的圖片或得了。那麼或得這個圖片之前我們肯定要顯示這個圖片，也就是必須模擬出以上步驟，光標聚焦的回復編輯框的事件我們不用來處理，因為我們要恢復肯定要填寫一定內容，但是下面的步驟就必須要模擬一下了。因為我們是在外部來訪問ie浏覽器，我們使用mshtml.dl。
首先在你的項目內引用mshtml.dll
using mshtml;
在類之前加入下面代碼，將com的交互訪問設置為true
[System.Runtime.InteropServices.ComVisible(true)]

然後我們定義一個獲取圖片的方法getimage，getimage通過mshtml的ihtml接口訪問ie內的html元素
        private Bitmap getImage()
        {
            mshtml.IHTMLDocument2 Doc;
            mshtml.IHTMLElement element;
            mshtml.IHTMLElementCollection all;
            SHDocVw.ShellWindows shellWindows = new SHDocVw.ShellWindowsClass();
            string filename;
            Bitmap Img=null;
//便利進程樹，從中找到浏覽器
            foreach (SHDocVw.InternetExplorer ie in shellWindows)
            {
                filename = Path.GetFileNameWithoutExtension(ie.FullName).ToLower();
                if (filename.Equals(“iexplore”)) //如果進程名為iexplore
                {

                    Doc = ie.Document as mshtml.IHTMLDocument2;
                    mshtml.IHTMLControlElement item;
                    HTMLBody body = (HTMLBody)Doc.body;
                    mshtml.IHTMLControlRange range = (IHTMLControlRange)body.createControlRange();
                    if (Doc.domain == “baidu.com”)
                    {

                       all = Doc.all;
                        element = all.item(“captcha”, null) as mshtml.IHTMLElement;
                        element.click(); .//模擬鼠標點擊動作
                       System.Threading.Thread.Sleep(1000); //暫停1秒等ie響應
                        item = all.item(“captcha_img”, null) as mshtml.IHTMLControlElement;
                        System.Threading.Thread.Sleep(1000);//等待ie讀取驗證碼圖片
                        range.add(item);
                        range.execCommand(“Copy”, false, null);
                        Img =new Bitmap(Clipboard.GetImage()); //從剪切板中獲取驗證碼圖片
                        Clipboard.Clear();
                    }
                }
            }
            return Img;
        }
現在我們已經獲取了一個圖像文件，接下來我們需要做的就是圖像的識別工作了，圖像識別這裡我們使用谷歌公司的開源庫Tesseract 來完成，先介紹一下Tesseract 。
這款名為Tesseract的OCR引擎最先由HP實驗室於1985年開始研發，至1995年時已經成為OCR業內最准確的三款識別引擎之一。然而，HP不久便決定放棄OCR業務，Tesseract也從從此塵封。
數年以後，HP意識到，與其將Tesseract束之高閣，不如貢獻給開源軟件業，讓其重煥新生－－2005年，Tesseract由美國內華達州信息技術研究所獲得，並求諸於Google對Tesseract進行改進、消除Bug、優化工作。
在修復了最重要的數個漏洞後，Google認為Tesseract OCR已經足夠穩定，可以重新以開源軟件方式發布。
http://sourceforge.net/projects/tesseract-ocr
我們使用的是Tesseract的.net版本,大家可以到http://www.pixel-technology.com/freeware/tessnet2/下載
接下來我們來做一下識別驗證碼的方法
        private void Ocr()
        {
         string datapath = System.Environment.CurrentDirectory + “\\tessdata“; //這個是必須的，我們的字庫信息就存儲在這裡
           tessnet2.Tesseract ocr = new tessnet2.Tesseract();
           ocr.Init(datapath, “eng”, false);
           ocr.OcrDone = new tessnet2.Tesseract.OcrDoneHandler(Done);//使用一個Handler接管完成識別後的操作
           ocr.DoOCR(getImage(), Rectangle.Empty);
         }
void Done((List<tessnet2.Word> Words){
str = Words[0]; //這裡的str是全局的。
}
驗證碼識別出來了，接下來我們把它填入驗證碼輸入框中
public void write()
       {
           mshtml.IHTMLDocument2 Doc;
           mshtml.IHTMLElement element;
           mshtml.IHTMLElementCollection all;
           SHDocVw.ShellWindows shellWindows = new SHDocVw.ShellWindowsClass();
           string filename;
           Bitmap Img = null;
           foreach (SHDocVw.InternetExplorer ie in shellWindows)
           {
               filename = Path.GetFileNameWithoutExtension(ie.FullName).ToLower();
               if (filename.Equals(“iexplore”))
               {
                   Doc = ie.Document as mshtml.IHTMLDocument2;

                   HTMLBody body = (HTMLBody)Doc.body;
                   mshtml.IHTMLControlRange range = (IHTMLControlRange)body.createControlRange();
                   if (Doc.domain == “baidu.com”)
                   {
                       all = Doc.all;

                       element= all.item(“captcha”, null) as mshtml.IHTMLElement;

                       element.setAttribute(“Value”, str, 0);
                   }
               }
           }
       }
這個程序基本上也算是完成了，試驗中邊度的驗證碼識別率大約在40%左右，也就是說有一半的驗證碼還是識別不出來，兩種可能，一是我沒有完全理解Tesseract的使用方法，二是百度的驗證碼兩個字符之間有些交叉的地方，人眼有時都需要仔細分辨才能看清楚，我之前試驗用Tesseract識別英文和數字，對於標准的的識別率能達到100%。好了就說到這裡的，如果哪位對Tesseract比較熟悉不妨幫我完善一下這個爛程序。還有就是ihtml接口也是第一次使用，用的比較糟爛