程式師世界 >> 編程語言 >> JAVA編程 >> 關於JAVA >> Java實現Web版RSS閱讀器（三）解析在線Rss訂閱

Java實現Web版RSS閱讀器（三）解析在線Rss訂閱

編輯：關於JAVA

上篇博客《 Web版RSS閱讀器（二）——使用dTree樹形加載rss訂閱分組列表》已經寫到讀取rss訂閱列表了，今天就說一下，當獲取一條在線rss訂閱的信息，怎麼去解析它，從而獲取文章或資訊。

首先說一下rss的版本。很多人都說rss，但是有相當一部分人，都不知道rss居然不只一種格式。我們常用的訂閱格式有Rss和Atom 2種格式。Rss有版本從v0.9一直到現在的v2.0，Atom最新的版本則是1.0。

DeveloperWorks有一篇文章《使用 RSS 和 Atom 實現新聞聯合》提及兩者的相似點與不同點：

RSS 和 Atom 摘要的相似點

每個摘要文件實際上代表一個通道。它包含通道標題、鏈接、描述、作者等等。通道信息提供關於摘要的基本信息。通道信息之後是一些項。每項代表一篇可以從摘要閱讀器閱讀的真實的新聞或者文章。通常情況下，每項包含有標題、鏈接、更新時間和摘要信息。

RSS 和 Atom 摘要的不同點

參考 RSS 2.0 and Atom 1.0, Compared，回顧 RSS 和 Atom 的不同點。

RSS 和 Atom 具有相似的基於 XML 的格式。它們的基本結構是相同的，只在節點的表達式上有一點區別。

在Rss標准格式：

<!-- XML版本和字符集 -->
　<?xml version="1.0"?>  
　<!-- RSS版本 -->
　<rss version="2.0">  
　<!-- 以下為頻道信息及新聞資訊或文章列表 -->
　<channel>  
　　<!-- 頻道總體信息：開始 -->
　　<!-- 頻道標題 -->
　　<title>Lift Off News</title>  
　　<!-- 頻道鏈接的總地址 -->
　　<link>http://liftoff.msfc.nasa.gov/</link>  
　　<!-- 頻道描述文字 -->
　　<description>Liftoff to Space Exploration.</description>  
　　<!-- 頻道使用的語言（zh-cn表示簡體中文） -->
　　<language>en-us</language>  
　　<!-- 頻道發布的時間 -->
　　<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>  
　　<!-- 頻道最後更新的時間-->
　　<lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>  
　　<docs>http://blogs.law.harvard.edu/tech/rss</docs>  
　　<!-- 頻道生成器 -->
　　<generator>Weblog Editor 2.0</generator>  
　　<ttl>5</ttl>  
　　<!-- 頻道總體信息：結束 -->
　　<!-- 每條RSS新聞信息都包含在item節點中, -->
　　<item>  
　　<!-- 新聞標題 -->
　　<title>Star City</title>  
　　<!-- 新聞鏈接地址 -->
　　<link>http://liftoff.msfc.nasa.gov/news/2013/news-starcity.asp</link>  
　　<!-- 新聞內容簡要描述 -->
　　<description>How do Americans get ready to work with Russians aboard the  
　　International Space Station? They take a crash course in culture, language  
　　and protocol at Russia's Star City.</description>  
　　<!-- 新聞發布時間 -->
　　<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>  
　　<!-- 新聞目錄 -->
　　<category>IT</category>  
　　<!-- 新聞作者 -->
　　<author>bill</author>  
　　<guid>http://liftoff.msfc.nasa.gov/2003/06/03.html#item573</guid>  
　　</item>  
　　<!-- 第二條新聞信息 -->
　　<item>  
　　<title>Space Exploration</title>  
　　<link>http://liftoff.msfc.nasa.gov/</link>  
　　<description>Sky watchers in Europe, Asia, and parts of Alaska and Canada  
　　will experience a partial eclipse of the Sun on Saturday, May 31st.</description>  
　　Fri, 30 May 2003 11:06:42 GMT</pubDate>  
　　<guid>http://liftoff.msfc.nasa.gov/2003/05/30.html#item572</guid>  
　　</item>　　  
　　</channel>  
　</rss>

Atom 1.0的格式：

<?xml version="1.0" encoding="utf-8"?>  
<!-- ATOM版本 -->
<feed xmlns="http://www.w3.org/2005/Atom"">  
    <!-- 頻道標題 -->
　<title>Schema Web</title>  
    <!-- 頻道鏈接的總地址 -->
　<link rel="alternate" type="text/html" href="http://stanzaweb.art/" />  
    <!-- 最新修改時間-->
　<modified>2004-06-01T10:11:12Z</modified>  
    <!-- 頻道作者 -->
　<author>  
        <!-- 昵稱 -->
　　<name>Uche Ogbuji</name>  
　</author>  
    <!-- 以下是新聞資訊或文章列表 -->
　<entry>  
    <!-- 新聞標題 -->
　　<title>Welcome to Stanza Web</title>  
    <!-- 新聞作者 -->
　　<author>  
          <!-- 作者昵稱 -->
　　　<name>龍軒</name>  
         <!-- 主頁 -->
         <uri>http://www.cnblogs.com/longxuan/</uri>  
　　</author>  
      <!-- 文章連接 -->
　　<link rel="alternate" type="text/html" href="http://stanzaweb.art/2004-06-01/welcome" />  
      <!-- 最新修改時間 -->
　　<modified>2004-06-01T10:11:12Z</modified>  
      <!-- 文章內容 -->
　　<content type="html">  
　　　<div >  
　　　　<p>Welcome to  
　　　　　<a href="http://stanzaweb.art/">Stanza Web</a>.  
　　　　　Come back often to keep track of the best in modern poetry.  
　　　　</p>  
　　　　<p>This site is powered by  
　　　　　 <a href="http://atomenabled.org">Atom</a>  
　　　　</p>  
　　　</div>  
　　</content>  
　</entry>  
</feed>

大部分新聞或博客網站都使用的是rss，當然Atom也占有部分市場。比如博客園就是用的Atom，而CSDN則用的是RSS。

了解了這些以後，就可以開始解析Rss了。

在網上找了一下開源的包，選了2款常用的都實驗了一下，一個是Rome.jar，一個是rsslib4j.jar。二者的區別我就不多介紹了，有興趣了可以去百度一下。rsslib4j 小巧，兼容性好，但是現在只支持解析rss 0.9x ,1.0 和 2.0，暫時對於atom無能為力。rsslib4j的開源主頁：http://sourceforge.net/projects/rsslib4j/。有什麼需要的，可以在主頁進行下載。

在WebRoot/lib中引用rsslib4j-0.2.jar，在src的com.tgb.rssreader.manager包中新建一個類Rsslib4jReadRss，貼出代碼：

package com.tgb.rssreader.manager;  
      
import java.net.URL;  
import java.net.URLConnection;  
import java.util.List;  
      
import org.gnu.stealthp.rsslib.RSSChannel;  
import org.gnu.stealthp.rsslib.RSSHandler;  
import org.gnu.stealthp.rsslib.RSSImage;  
import org.gnu.stealthp.rsslib.RSSItem;  
import org.gnu.stealthp.rsslib.RSSParser;  
      
public class Rsslib4jReadRss {  
      
    //這裡定義一個在線的rss摘要的地址(對應我的網易博客）
	//
			
		
		
由於文章太多，在Console中測試時，可能會看不到後面的效果，所以我只讓程序讀取了一個文章摘要（for循環次數修改為1），效果圖如下：

在解析網易博客時，還算勉強可以勝任，但是在解析CSDN博客時，就會報錯"Server returned HTTP response code: 403 for URL: http://xxxxxx"，這是因為CSDN博客，拒絕java作為客戶端進行訪問其服務器。而且在解析個別信息時，會出現null值。
那怎麼辦呀？別著急，下篇博文，大家跟我一起修改rsslib4j，做自己的rsslib4j。敬請期待！