字符集基礎:
Character set(字符集)
字符的集合,也就是,帶有特殊語義的符號。字母“A”是一個字符。“%”也是一個字符。沒有內在數字價值,與 ASC II ,Unicode,甚至是電腦也沒有任何的直接聯系。在電腦產生前的很長一段時間內,符號就已經存在了。
Coded character set(編碼字符集)
一個數值賦給一個字符的集合。把代碼賦值給字符,這樣它們就可以用特定的字符編碼集表達數字的結果。其他的編碼字符集可以賦不同的數值到同一個字符上。字符集映射通常是由標准組織確定的,例如 USASCII ,ISO 8859 -1,Unicode (ISO 10646 -1) ,以及 JIS X0201。
Character-encoding scheme(字符編碼方案)
編碼字符集成員到八位字節(8 bit 字節)的映射。編碼方案定義了如何把字符編碼的序列表達為字節序列。字符編碼的數值不需要與編碼字節相同,也不需要是一對一或一對多個的關系。原則上,把字符集編碼和解碼近似視為對象的序列化和反序列化。
通常字符數據編碼是用於網絡傳輸或文件存儲。編碼方案不是字符集,它是映射;但是因為它們之間的緊密聯系,大部分編碼都與一個獨立的字符集相關聯。例如,UTF -8,
僅用來編碼 Unicode字符集。盡管如此,用一個編碼方案處理多個字符集還是可能發生的。例如,EUC 可以對幾個亞洲語言的字符進行編碼。
圖6-1 是使用 UTF -8 編碼方案將 Unicode字符序列編碼為字節序列的圖形表達式。UTF -8把小於 0x80 的字符代碼值編碼成一個單字節值(標准 ASC II )。所有其他的 Unicode字符都被編碼成 2 到6 個字節的多字節序列(http://www.ietf.org/rfc/rfc2279.txt )。
Charset(字符集)
術語 charset 是在RFC2278(http://ietf.org/rfc/rfc2278.txt) 中定義的。它是編碼字符集 和字符編碼方案的集合。java.nio.charset 包的類是 Charset,它封裝字符集抽取。
1111111111111111
Unicode是16-位字符編碼。它試著把全世界所有語言的字符集統一到一個獨立的、全面的映射中。它贏得了一席之地,但是目前仍有許多其他字符編碼正在被廣泛的使用。
大部分的操作系統在 I/O 與文件存儲方面仍是以字節為導向的,所以無論使用何種編碼,Unicode或其他編碼,在字節序列和字符集編碼之間仍需要進行轉化。
由java.nio.charset 包組成的類滿足了這個需求。這不是 Java 平台第一次處理字符集編碼,但是它是最系統、最全面、以及最靈活的解決方式。java.nio.charset.spi包提供服務器供給接口(SPI),使編碼器和解碼器可以根據需要選擇插入。
字符集:在JVM 啟動時確定默認值,取決於潛在的操作系統環境、區域設置、和/或JVM配置。如果您需要一個指定的字符集,最安全的辦法是明確的命名它。不要假設默認部署與您的開發環境相同。字符集名稱不區分大小寫,也就是,當比較字符集名稱時認為大寫字母和小寫字母相同。互聯網名稱分配機構(IANA )維護所有正式注冊的字符集名稱。
示例6-1 演示了通過不同的 Charset實現如何把字符翻譯成字節序列。
示例6 -1. 使用標准字符集編碼
代碼如下:
package com.ronsoft.books.nio.charset;
import java.nio.charset.Charset;
import java.nio.ByteBuffer;
/**
* Charset encoding test. Run the same input string, which contains some
* non-ascii characters, through several Charset encoders and dump out the hex
* values of the resulting byte sequences.
*
* @author Ron Hitchens ([email protected])
*/
public class EncodeTest {
public static void main(String[] argv) throws Exception {
// This is the character sequence to encode
String input = " \u00bfMa\u00f1ana?";
// the list of charsets to encode with
String[] charsetNames = { "US-ASCII", "ISO-8859-1", "UTF-8",
"UTF-16BE", "UTF-16LE", "UTF-16" // , "X-ROT13"
};
for (int i = 0; i < charsetNames.length; i++) {
doEncode(Charset.forName(charsetNames[i]), input);
}
}
/**
* For a given Charset and input string, encode the chars and print out the
* resulting byte encoding in a readable form.
*/
private static void doEncode(Charset cs, String input) {
ByteBuffer bb = cs.encode(input);
System.out.println("Charset: " + cs.name());
System.out.println(" Input: " + input);
System.out.println("Encoded: ");
for (int i = 0; bb.hasRemaining(); i++) {
int b = bb.get();
int ival = ((int) b) & 0xff;
char c = (char) ival;
// Keep tabular alignment pretty
if (i < 10)
System.out.print(" ");
// Print index number
System.out.print(" " + i + ": ");
// Better formatted output is coming someday...
if (ival < 16)
System.out.print("0");
// Print the hex value of the byte
System.out.print(Integer.toHexString(ival));
// If the byte seems to be the value of a
// printable character, print it. No guarantee
// it will be.
if (Character.isWhitespace(c) || Character.isISOControl(c)) {
System.out.println("");
} else {
System.out.println(" (" + c + ")");
}
}
System.out.println("");
}
}
結果:
代碼如下:
Charset: US-ASCII
Input: ?Ma?ana?
Encoded:
0: 20
1: 3f (?)
2: 4d (M)
3: 61 (a)
4: 3f (?)
5: 61 (a)
6: 6e (n)
7: 61 (a)
8: 3f (?)
Charset: ISO-8859-1
Input: ?Ma?ana?
Encoded:
0: 20
1: bf (?)
2: 4d (M)
3: 61 (a)
4: f1 (?)
5: 61 (a)
6: 6e (n)
7: 61 (a)
8: 3f (?)
Charset: UTF-8
Input: ?Ma?ana?
Encoded:
0: 20
1: c2 (?)
2: bf (?)
3: 4d (M)
4: 61 (a)
5: c3 (?)
6: b1 (±)
7: 61 (a)
8: 6e (n)
9: 61 (a)
10: 3f (?)
Charset: UTF-16BE
Input: ?Ma?ana?
Encoded:
0: 00
1: 20
2: 00
3: bf (?)
4: 00
5: 4d (M)
6: 00
7: 61 (a)
8: 00
9: f1 (?)
10: 00
11: 61 (a)
12: 00
13: 6e (n)
14: 00
15: 61 (a)
16: 00
17: 3f (?)
Charset: UTF-16LE
Input: ?Ma?ana?
Encoded:
0: 20
1: 00
2: bf (?)
3: 00
4: 4d (M)
5: 00
6: 61 (a)
7: 00
8: f1 (?)
9: 00
10: 61 (a)
11: 00
12: 6e (n)
13: 00
14: 61 (a)
15: 00
16: 3f (?)
17: 00
Charset: UTF-16
Input: ?Ma?ana?
Encoded:
0: fe (?)
1: ff (?)
2: 00
3: 20
4: 00
5: bf (?)
6: 00
7: 4d (M)
8: 00
9: 61 (a)
10: 00
11: f1 (?)
12: 00
13: 61 (a)
14: 00
15: 6e (n)
16: 00
17: 61 (a)
18: 00
19: 3f (?)
字符集類:
代碼如下:
package java.nio.charset;
public abstract class Charset implements Comparable
{
public static boolean isSupported (String charsetName)
public static Charset forName (String charsetName)
public static SortedMap availableCharsets()
public final String name()
public final Set aliases()
public String displayName()
public String displayName (Locale locale)
public final boolean isRegistered()
public boolean canEncode()
public abstract CharsetEncoder newEncoder();
public final ByteBuffer encode (CharBuffer cb)
public final ByteBuffer encode (String str)
public abstract CharsetDecoder newDecoder();
public final CharBuffer decode (ByteBuffer bb)
public abstract boolean contains (Charset cs);
public final boolean equals (Object ob)
public final int compareTo (Object ob)
public final int hashCode()
public final String toString()
}
那麼Charset對象需要滿足幾個條件:
字符集的規范名稱應與在 IANA 注冊的名稱相符。
如果IANA 用同一個字符集注冊了多個名稱,對象返回的規范名稱應該與 IANA 注冊中的MIME -首選名稱相符。
如果字符集名稱從注冊中移除,那麼當前的規范名稱應保留為別名。
如果字符集沒有在 IANA 注冊,它的規范名稱必須以“X -”或“x-”開頭。
大多數情況下,只有 JVM賣家才會關注這些規則。然而,如果您打算以您自己的字符集作為應用的一部分,那麼了解這些不該做的事情將對您很有幫助。針對 isRegistered() 您應該返回 false 並以“X -”開頭命名您的字符集。
字符集比較:
代碼如下:
public abstract class Charset implements Comparable
{
// This is a partial API listing
public abstract boolean contains (Charset cs);
public final boolean equals (Object ob)
public final int compareTo (Object ob)
public final int hashCode()
public final String toString()
}
回想一下,字符集是由字符的編碼集與該字符集的編碼方案組成的。與普通的集合類似,一個字符集可能是另一個字符集的子集。一個字符集(C 1)包含另一個(C 2),表示在C 2 中表達的每個字符都可以在 C 1 中進行相同的表達。每個字符集都被認為是包含其本身。如果這個包含關系成立,那麼您在 C 2(被包含的子集)中編碼的任意流在 C 1 中也一定可以編碼,無需任何替換。
字符集編碼器:字符集是由一個編碼字符集和一個相關編碼方案組成的。CharsetEncoder 和CharsetDecoder 類實現轉換方案。
代碼如下:
float averageBytesPerChar()
Returns the average number of bytes that will be produced for each character of input.
boolean canEncode(char c)
Tells whether or not this encoder can encode the given character.
boolean canEncode(CharSequence cs)
Tells whether or not this encoder can encode the given character sequence.
Charset charset()
Returns the charset that created this encoder.
ByteBuffer encode(CharBuffer in)
Convenience method that encodes the remaining content of a single input character buffer into a newly-allocated byte buffer.
CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer.
protected abstract CoderResult encodeLoop(CharBuffer in, ByteBuffer out)
Encodes one or more characters into one or more bytes.
CoderResult flush(ByteBuffer out)
Flushes this encoder.
protected CoderResult implFlush(ByteBuffer out)
Flushes this encoder.
protected void implOnMalformedInput(CodingErrorAction newAction)
Reports a change to this encoder's malformed-input action.
protected void implOnUnmappableCharacter(CodingErrorAction newAction)
Reports a change to this encoder's unmappable-character action.
protected void implReplaceWith(byte[] newReplacement)
Reports a change to this encoder's replacement value.
protected void implReset()
Resets this encoder, clearing any charset-specific internal state.
boolean isLegalReplacement(byte[] repl)
Tells whether or not the given byte array is a legal replacement value for this encoder.
CodingErrorAction malformedInputAction()
Returns this encoder's current action for malformed-input errors.
float maxBytesPerChar()
Returns the maximum number of bytes that will be produced for each character of input.
CharsetEncoder onMalformedInput(CodingErrorAction newAction)
Changes this encoder's action for malformed-input errors.
CharsetEncoder onUnmappableCharacter(CodingErrorAction newAction)
Changes this encoder's action for unmappable-character errors.
byte[] replacement()
Returns this encoder's replacement value.
CharsetEncoder replaceWith(byte[] newReplacement)
Changes this encoder's replacement value.
CharsetEncoder reset()
Resets this encoder, clearing any internal state.
CodingErrorAction unmappableCharacterAction()
Returns this encoder's current action for unmappable-character errors.
CharsetEncoder 對象是一個狀態轉換引擎:字符進去,字節出來。一些編碼器的調用可能需要完成轉換。編碼器存儲在調用之間轉換的狀態。
關於 CharsetEncoder API 的一個注意事項:首先,越簡單的encode() 形式越方便,在重新分配的 ByteBuffer中您提供的 CharBuffer 的編碼集所有的編碼於一身。這是當您在 Charset類上直接調用 encode() 時最後調用的方法。
Underflow(下溢)
Overflow (上溢)
Malformed input(有缺陷的輸入)
Unmappable character (無映射字符)
編碼時,如果編碼器遭遇了有缺陷的或不能映射的輸入,返回結果對象。您也可以檢測獨立的字符,或者字符序列,來確定它們是否能被編碼。下面是檢測能否進行編碼的方法:
代碼如下:
package java.nio.charset;
public abstract class CharsetEncoder
{
// This is a partial API listing
public boolean canEncode (char c)
public boolean canEncode (CharSequence cs)
}
CodingErrorAction 定義了三個公共域:
REPORT (報告)
創建 CharsetEncoder 時的默認行為。這個行為表示編碼錯誤應該通過返回 CoderResult 對象
報告,前面提到過。
IGNORE (忽略)
表示應忽略編碼錯誤並且如果位置不對的話任何錯誤的輸入都應中止。
REPLACE(替換)
通過中止錯誤的輸入並輸出針對該 CharsetEncoder 定義的當前的替換字節序列處理編碼錯誤。
記住,字符集編碼把字符轉化成字節序列,為以後的解碼做准備。如果替換序列不能被解碼成有效的字符序列,編碼字節序列變為無效。
CoderResult類:CoderResult 對象是由 CharsetEncoder 和CharsetDecoder 對象返回的:
代碼如下:
package java.nio.charset;
public class CoderResult {
public static final CoderResult OVERFLOW
public static final CoderResult UNDERFLOW
public boolean isUnderflow()
public boolean isOverflow()
<span > </span>public boolean isError()
public boolean isMalformed()
public boolean isUnmappable()
public int length()
public static CoderResult malformedForLength (int length)
public static CoderResult unmappableForLength (int length)
<span > </span>public void throwException() throws CharacterCodingException
}
字符集解碼器:字符集解碼器是編碼器的逆轉。通過特殊的編碼方案把字節編碼轉化成 16-位Unicode字符的序列。與 CharsetEncoder 類似的, CharsetDecoder 是狀態轉換引擎。兩個都不是線程安全的,因為調用它們的方法的同時也會改變它們的狀態,並且這些狀態會被保留下來。
代碼如下:
float averageCharsPerByte()
Returns the average number of characters that will be produced for each byte of input.
Charset charset()
Returns the charset that created this decoder.
CharBuffer decode(ByteBuffer in)
Convenience method that decodes the remaining content of a single input byte buffer into a newly-allocated character buffer.
CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
Decodes as many bytes as possible from the given input buffer, writing the results to the given output buffer.
protected abstract CoderResult decodeLoop(ByteBuffer in, CharBuffer out)
Decodes one or more bytes into one or more characters.
Charset detectedCharset()
Retrieves the charset that was detected by this decoder (optional operation).
CoderResult flush(CharBuffer out)
Flushes this decoder.
protected CoderResult implFlush(CharBuffer out)
Flushes this decoder.
protected void implOnMalformedInput(CodingErrorAction newAction)
Reports a change to this decoder's malformed-input action.
protected void implOnUnmappableCharacter(CodingErrorAction newAction)
Reports a change to this decoder's unmappable-character action.
protected void implReplaceWith(String newReplacement)
Reports a change to this decoder's replacement value.
protected void implReset()
Resets this decoder, clearing any charset-specific internal state.
boolean isAutoDetecting()
Tells whether or not this decoder implements an auto-detecting charset.
boolean isCharsetDetected()
Tells whether or not this decoder has yet detected a charset (optional operation).
CodingErrorAction malformedInputAction()
Returns this decoder's current action for malformed-input errors.
float maxCharsPerByte()
Returns the maximum number of characters that will be produced for each byte of input.
CharsetDecoder onMalformedInput(CodingErrorAction newAction)
Changes this decoder's action for malformed-input errors.
CharsetDecoder onUnmappableCharacter(CodingErrorAction newAction)
Changes this decoder's action for unmappable-character errors.
String replacement()
Returns this decoder's replacement value.
CharsetDecoder replaceWith(String newReplacement)
Changes this decoder's replacement value.
CharsetDecoder reset()
Resets this decoder, clearing any internal state.
CodingErrorAction unmappableCharacterAction()
Returns this decoder's current action for unmappable-character errors.
實際完成解碼的方法上:
代碼如下:
package java.nio.charset;
public abstract class CharsetDecoder
{
// This is a partial API listing
public final CharsetDecoder reset()
public final CharBuffer decode (ByteBuffer in)
throws CharacterCodingException
public final CoderResult decode (ByteBuffer in, CharBuffer out,
boolean endOfInput)
public final CoderResult flush (CharBuffer out)
}
解碼處理和編碼類似,包含相同的基本步驟:
1. 復位解碼器,通過調用 reset() ,把解碼器放在一個已知的狀態准備用來接收輸入。
2. 把endOfInput 設置成 false 不調用或多次調用 decode(),供給字節到解碼引擎中。隨著解碼的進行,字符將被添加到給定的 CharBuffer 中。
3. 把endOfInput 設置成 true 調用一次 decode(),通知解碼器已經提供了所有的輸入。
4. 調用flush() ,確保所有的解碼字符都已經發送給輸出。
示例6-2 說明了如何對表示字符集編碼的字節流進行編碼。
示例6 -2. 字符集解碼
代碼如下:
package com.ronsoft.books.nio.charset;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;
import java.io.*;
/**
* Test charset decoding.
*
* @author Ron Hitchens ([email protected])
*/
public class CharsetDecode {
/**
* Test charset decoding in the general case, detecting and handling buffer
* under/overflow and flushing the decoder state at end of input. This code
* reads from stdin and decodes the ASCII-encoded byte stream to chars. The
* decoded chars are written to stdout. This is effectively a 'cat' for
* input ascii files, but another charset encoding could be used by simply
* specifying it on the command line.
*/
public static void main(String[] argv) throws IOException {
// Default charset is standard ASCII
String charsetName = "ISO-8859-1";
// Charset name can be specified on the command line
if (argv.length > 0) {
charsetName = argv[0];
}
// Wrap a Channel around stdin, wrap a channel around stdout,
// find the named Charset and pass them to the deco de method.
// If the named charset is not valid, an exception of type
// UnsupportedCharsetException will be thrown.
decodeChannel(Channels.newChannel(System.in), new OutputStreamWriter(
System.out), Charset.forName(charsetName));
}
/**
* General purpose static method which reads bytes from a Channel, decodes
* them according
*
* @param source
* A ReadableByteChannel object which will be read to EOF as a
* source of encoded bytes.
* @param writer
* A Writer object to which decoded chars will be written.
* @param charset
* A Charset object, whose CharsetDecoder will be used to do the
* character set decoding. Java NIO 206
*/
public static void decodeChannel(ReadableByteChannel source, Writer writer,
Charset charset) throws UnsupportedCharsetException, IOException {
// Get a decoder instance from the Charset
CharsetDecoder decoder = charset.newDecoder();
// Tell decoder to replace bad chars with default mark
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
// Allocate radically different input and output buffer sizes
// for testing purposes
ByteBuffer bb = ByteBuffer.allocateDirect(16 * 1024);
CharBuffer cb = CharBuffer.allocate(57);
// Buffer starts empty; indicate input is needed
CoderResult result = CoderResult.UNDERFLOW;
boolean eof = false;
while (!eof) {
// Input buffer underflow; decoder wants more input
if (result == CoderResult.UNDERFLOW) {
// decoder consumed all input, prepare to refill
bb.clear();
// Fill the input buffer; watch for EOF
eof = (source.read(bb) == -1);
// Prepare the buffer for reading by decoder
bb.flip();
}
// Decode input bytes to output chars; pass EOF flag
result = decoder.decode(bb, cb, eof);
// If output buffer is full, drain output
if (result == CoderResult.OVERFLOW) {
drainCharBuf(cb, writer);
}
}
// Flush any remaining state from the decoder, being careful
// to detect output buffer overflow(s)
while (decoder.flush(cb) == CoderResult.OVERFLOW) {
drainCharBuf(cb, writer);
}
// Drain any chars remaining in the output buffer
drainCharBuf(cb, writer);
// Close the channel; push out any buffered data to stdout
source.close();
writer.flush();
}
/**
* Helper method to drain the char buffer and write its content to the given
* Writer object. Upon return, the buffer is empty and ready to be refilled.
*
* @param cb
* A CharBuffer containing chars to be written.
* @param writer
* A Writer object to consume the chars in cb.
*/
static void drainCharBuf(CharBuffer cb, Writer writer) throws IOException {
cb.flip(); // Prepare buffer for draining
// This writes the chars contained in the CharBuffer but
// doesn't actually modify the state of the buffer.
// If the char buffer was being drained by calls to get( ),
// a loop might be needed here.
if (cb.hasRemaining()) {
writer.write(cb.toString());
}
cb.clear(); // Prepare buffer to be filled again
}
}
字符集服務器供應者接口:可插拔的 SPI 結構是在許多不同的內容中貫穿於 Java 環境使用的。在 1.4JDK中有八個包,一個叫spi 而剩下的有其它的名稱。可插拔是一個功能強大的設計技術,是在 Java 的可移植性和適應性上建立的基石之一。
在浏覽 API 之前,需要解釋一下 Charset SPI 如何工作。java.nio.charset.spi 包僅包含一個抽取類,CharsetProvider 。這個類的具體實現供給與它們提供過的 Charset對象相關的信息。為了定義自定義字符集,您首先必須從 java.nio.charset package中創建 Charset, CharsetEncoder,以及CharsetDecoder 的具體實現。然後您創建CharsetProvider 的自定義子類,它將把那些類提供給JVM。
創建自定義字符集:
您至少要做的是創建 java.nio.charset.Charset 的子類、提供三個抽取方法的具體實現以及一個構造函數。Charset類沒有默認的,無參數的構造函數。這表示您的自定義字符集類必須有一個構造函數,即使它不接受參數。這是因為您必須在實例化時調用 Charset的構造函數(通過在您的構造函數的開端調用 super() ),從而通過您的字符集規范名稱和別名供給它。這樣做可以讓 Charset類中的方法幫您處理和名稱相關的事情,所以是件好事。
同樣地,您需要提供 CharsetEncoder和CharsetDecoder 的具體實現。回想一下,字符集是編碼的字符和編碼/解碼方案的集合。如我們之前所看到的,編碼和解碼在 API 水平上幾乎是對稱的。這裡給出了關於實現編碼器所需要的東西的簡短討論:一樣適用於建立解碼器。
與Charset類似的, CharsetEncoder 沒有默認的構造函數,所以您需要在具體類構造函數中調用super() ,提供需要的參數。
為了供給您自己的 CharsetEncoder 實現,您至少要提供具體encodeLoop () 方法。對於簡單的編碼運算法則,其他方法的默認實現應該可以正常進行。注意encodeLoop() 采用和 encode() 的參數類似的參數,不包括布爾標志。encode () 方法代表到encodeLoop() 的實際編碼,它僅需要關注來自 CharBuffer 參數消耗的字符,並且輸出編碼的字節到提供的 ByteBuffer上。
現在,我們已經看到了如何實現自定義字符集,包括相關的編碼器和解碼器,讓我們看一下如何把它們連接到 JVM中,這樣可以利用它們運行代碼。
供給您的自定義字符集:
為了給 JVM運行時環境提供您自己的 Charset實現,您必須在 java.nio.charsets. - spi 中創建 CharsetProvider 類的具體子類,每個都帶有一個無參數構造函數。無參數構造函數很重要,因為您的 CharsetProvider 類將要通過讀取配置文件的全部合格名稱進行定位。之後這個類名稱字符串將被導入到 Class.newInstance() 來實例化您的提供方,它僅通過無參數構造函數起作用。
JVM讀取的配置文件定位字符集提供方,被命名為 java.nio.charset.spi.CharsetProvider 。它在JVM類路徑中位於源目錄(META-INF/services)中。每一個 JavaArchive(Java 檔案文件)(JAR )都有一個 META-INF 目錄,它可以包含在那個 JAR 中的類和資源的信息。一個名為META-INF 的目錄也可以在 JVM類路徑中放置在常規目錄的頂端。
CharsetProvider 的API 幾乎是沒有作用的。提供自定義字符集的實際工作是發生在創建自定義 Charset,CharsetEncoder,以及 CharsetDecoder 類中。CharsetProvider 僅是連接您的字符集和運行時環境的促進者。
示例 6-3 中演示了自定義 Charset和CharsetProvider 的實現,包含說明字符集使用的取樣代碼,編碼和解碼,以及 Charset SPI。示例 6-3 實現了一個自定義Charset。
示例6 -3. 自定義Rot13 字符集
代碼如下:
package com.ronsoft.books.nio.charset;
import java.nio.CharBuffer;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.util.Map;
import java.util.Iterator;
import java.io.Writer;
import java.io.PrintStream;
import java.io.PrintWriter;
import java.io.OutputStreamWriter;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileReader;
/**
* A Charset implementation which performs Rot13 encoding. Rot -13 encoding is a
* simple text obfuscation algorithm which shifts alphabetical characters by 13
* so that 'a' becomes 'n', 'o' becomes 'b', etc. This algorithm was popularized
* by the Usenet discussion forums many years ago to mask naughty words, hide
* answers to questions, and so on. The Rot13 algorithm is symmetrical, applying
* it to text that has been scrambled by Rot13 will give you the original
* unscrambled text.
*
* Applying this Charset encoding to an output stream will cause everything you
* write to that stream to be Rot13 scrambled as it's written out. And appying
* it to an input stream causes data read to be Rot13 descrambled as it's read.
*
* @author Ron Hitchens ([email protected])
*/
public class Rot13Charset extends Charset {
// the name of the base charset encoding we delegate to
private static final String BASE_CHARSET_NAME = "UTF-8";
// Handle to the real charset we'll use for transcoding between
// characters and bytes. Doing this allows us to apply the Rot13
// algorithm to multibyte charset encodings. But only the
// ASCII alpha chars will be rotated, regardless of the base encoding.
Charset baseCharset;
/**
* Constructor for the Rot13 charset. Call the superclass constructor to
* pass along the name(s) we'll be known by. Then save a reference to the
* delegate Charset.
*/
protected Rot13Charset(String canonical, String[] aliases) {
super(canonical, aliases);
// Save the base charset we're delegating to
baseCharset = Charset.forName(BASE_CHARSET_NAME);
}
// ----------------------------------------------------------
/**
* Called by users of this Charset to obtain an encoder. This implementation
* instantiates an instance of a private class (defined below) and passes it
* an encoder from the base Charset.
*/
public CharsetEncoder newEncoder() {
return new Rot13Encoder(this, baseCharset.newEncoder());
}
/**
* Called by users of this Charset to obtain a decoder. This implementation
* instantiates an instance of a private class (defined below) and passes it
* a decoder from the base Charset.
*/
public CharsetDecoder newDecoder() {
return new Rot13Decoder(this, baseCharset.newDecoder());
}
/**
* This method must be implemented by concrete Charsets. We always say no,
* which is safe.
*/
public boolean contains(Charset cs) {
return (false);
}
/**
* Common routine to rotate all the ASCII alpha chars in the given
* CharBuffer by 13. Note that this code explicitly compares for upper and
* lower case ASCII chars rather than using the methods
* Character.isLowerCase and Character.isUpperCase. This is because the
* rotate-by-13 scheme only works properly for the alphabetic characters of
* the ASCII charset and those methods can return true for non-ASCII Unicode
* chars.
*/
private void rot13(CharBuffer cb) {
for (int pos = cb.position(); pos < cb.limit(); pos++) {
char c = cb.get(pos);
char a = '\u0000';
// Is it lowercase alpha?
if ((c >= 'a') && (c <= 'z')) {
a = 'a';
}
// Is it uppercase alpha?
if ((c >= 'A') && (c <= 'Z')) {
a = 'A';
}
// If either, roll it by 13
if (a != '\u0000') {
c = (char) ((((c - a) + 13) % 26) + a);
cb.put(pos, c);
}
}
}
// --------------------------------------------------------
/**
* The encoder implementation for the Rot13 Chars et. This class, and the
* matching decoder class below, should also override the "impl" methods,
* such as implOnMalformedInput( ) and make passthrough calls to the
* baseEncoder object. That is left as an exercise for the hacker.
*/
private class Rot13Encoder extends CharsetEncoder {
private CharsetEncoder baseEncoder;
/**
* Constructor, call the superclass constructor with the Charset object
* and the encodings sizes from the delegate encoder.
*/
Rot13Encoder(Charset cs, CharsetEncoder baseEncoder) {
super(cs, baseEncoder.averageBytesPerChar(), baseEncoder
.maxBytesPerChar());
this.baseEncoder = baseEncoder;
}
/**
* Implementation of the encoding loop. First, we apply the Rot13
* scrambling algorithm to the CharBuffer, then reset the encoder for
* the base Charset and call it's encode( ) method to do the actual
* encoding. This may not work properly for non -Latin charsets. The
* CharBuffer passed in may be read -only or re-used by the caller for
* other purposes so we duplicate it and apply the Rot13 encoding to the
* copy. We DO want to advance the position of the input buffer to
* reflect the chars consumed.
*/
protected CoderResult encodeLoop(CharBuffer cb, ByteBuffer bb) {
CharBuffer tmpcb = CharBuffer.allocate(cb.remaining());
while (cb.hasRemaining()) {
tmpcb.put(cb.get());
}
tmpcb.rewind();
rot13(tmpcb);
baseEncoder.reset();
CoderResult cr = baseEncoder.encode(tmpcb, bb, true);
// If error or output overflow, we need to adjust
// the position of the input buffer to match what
// was really consumed from the temp buffer. If
// underflow (all input consumed), this is a no-op.
cb.position(cb.position() - tmpcb.remaining());
return (cr);
}
}
// --------------------------------------------------------
/**
* The decoder implementation for the Rot13 Charset.
*/
private class Rot13Decoder extends CharsetDecoder {
private CharsetDecoder baseDecoder;
/**
* Constructor, call the superclass constructor with the Charset object
* and pass alon the chars/byte values from the delegate decoder.
*/
Rot13Decoder(Charset cs, CharsetDecoder baseDecoder) {
super(cs, baseDecoder.averageCharsPerByte(), baseDecoder
.maxCharsPerByte());
this.baseDecoder = baseDecoder;
}
/**
* Implementation of the decoding loop. First, we reset the decoder for
* the base charset, then call it to decode the bytes into characters,
* saving the result code. The CharBuffer is then de-scrambled with the
* Rot13 algorithm and the result code is returned. This may not work
* properly for non -Latin charsets.
*/
protected CoderResult decodeLoop(ByteBuffer bb, CharBuffer cb) {
baseDecoder.reset();
CoderResult result = baseDecoder.decode(bb, cb, true);
rot13(cb);
return (result);
}
}
// --------------------------------------------------------
/**
* Unit test for the Rot13 Charset. This main( ) will open and read an input
* file if named on the command line, or stdin if no args are provided, and
* write the contents to stdout via the X -ROT13 charset encoding. The
* "encryption" implemented by the Rot13 algorithm is symmetrical. Feeding
* in a plain-text file, such as Java source code for example, will output a
* scrambled version. Feeding the scrambled version back in will yield the
* original plain-text document.
*/
public static void main(String[] argv) throws Exception {
BufferedReader in;
if (argv.length > 0) {
// Open the named file
in = new BufferedReader(new FileReader(argv[0]));
} else {
// Wrap a BufferedReader around stdin
in = new BufferedReader(new InputStreamReader(System.in));
}
// Create a PrintStream that uses the Rot13 encoding
PrintStream out = new PrintStream(System.out, false, "X -ROT13");
String s = null;
// Read all input and write it to the output.
// As the data passes through the PrintStream,
// it will be Rot13-encoded.
while ((s = in.readLine()) != null) {
out.println(s);
}
out.flush();
}
}
為了使用這個 Charset和它的編碼器與解碼器,它必須對 Java 運行時環境有效。用CharsetProvider 類完成(示例 6-4)。
示例6 -4. 自定義字符集提供方
代碼如下:
package com.ronsoft.books.nio.charset;
import java.nio.charset.Charset;
import java.nio.charset.spi.CharsetProvider;
import java.util.HashSet;
import java.util.Iterator;
/**
* A CharsetProvider class which makes available the charsets provided by
* Ronsoft. Currently there is only one, namely the X -ROT13 charset. This is
* not a registered IANA charset, so it's name begins with "X-" to avoid name
* clashes with offical charsets.
*
* To activate this CharsetProvider, it's necessary to add a file to the
* classpath of the JVM runtime at the following location:
* META-INF/services/java.nio.charsets.spi.CharsetP rovider
*
* That file must contain a line with the fully qualified name of this class on
* a line by itself: com.ronsoft.books.nio.charset.RonsoftCharsetProvider Java
* NIO 216
*
* See the javadoc page for java.nio.charsets.spi.CharsetProvider for full
* details.
*
* @author Ron Hitchens ([email protected])
*/
public class RonsoftCharsetProvider extends CharsetProvider {
// the name of the charset we provide
private static final String CHARSET_NAME = "X-ROT13";
// a handle to the Charset object
private Charset rot13 = null;
/**
* Constructor, instantiate a Charset object and save the reference.
*/
public RonsoftCharsetProvider() {
this.rot13 = new Rot13Charset(CHARSET_NAME, new String[0]);
}
/**
* Called by Charset static methods to find a particular named Charset. If
* it's the name of this charset (we don't have any aliases) then return the
* Rot13 Charset, else return null.
*/
public Charset charsetForName(String charsetName) {
if (charsetName.equalsIgnoreCase(CHARSET_NAME)) {
return (rot13);
}
return (null);
}
/**
* Return an Iterator over the set of Charset objects we provide.
*
* @return An Iterator object containing references to all the Charset
* objects provided by this class.
*/
public Iterator<Charset> charsets() {
HashSet<Charset> set = new HashSet<Charset>(1);
set.add(rot13);
return (set.iterator());
}
}
對於通過 JVM運行時環境看到的這個字符集提供方,名為META_INF/services/java.nio.charset.spi.CharsetProvider的文件必須存在於 JARs 之一內或類路徑的目錄中。那個文件的內容必須是:
com.ronsoft.books.nio.charset.RonsoftCharsetProvider
代碼如下:
在示例 6-1 中的字符集清單中添加 X -ROT13,產生這個額外的輸出:
Charset: X-ROT13
Input: żMańana?
Encoded:
0: c2 (Ż)
1: bf (ż)
2: 5a (Z)
3: 6e (n)
4: c3 (Ă)
5: b1 (±)
6: 6e (n)
7: 61 (a)
8: 6e (n)
9: 3f (?)
總結:許多Java 編程人員永遠不會需要處理字符集編碼轉換問題,而大多數永遠不會創建自定義字符集。但是對於那些需要的人,在 java.nio.charset 和java.charset.spi 中的一系列類為字符處理提供了強大的以及彈性的機制。
Charset(字符集類)
封裝編碼的字符集編碼方案,用來表示與作為字節序列的字符集不同的字符序列。
CharsetEncoder(字符集編碼類)
編碼引擎,把字符序列轉化成字節序列。之後字節序列可以被解碼從而重新構造源字符序列。
CharsetDecoder(字符集解碼器類)
解碼引擎,把編碼的字節序列轉化為字符序列。
CharsetProvider SPI(字符集供應商 SPI)
通過服務器供應商機制定位並使 Charset實現可用,從而在運行時環境中使用。