程式師世界 >> 編程語言 >> JAVA編程 >> 關於JAVA >> Quartett!的二進制腳本分析

Quartett!的二進制腳本分析

編輯：關於JAVA

我前兩天在NetOA方面確實是有點懈怠了。不為別的，正是為了這篇將提到的腳本的分析。雖然沒把分析做徹底，不過我覺得現在已經足夠使用，順便拿出來說說。

上個周末，漢公突然跟我提起FFDSystem的話題，然後有人聯系我做Quartett!的漢化。自從跟漢公和明大合作參與漢化以來，我基本上就是做腳本處理的相關工作比較多；漢公解決破解的棘手問題，而明大主要完成打包問題，也兼做腳本編輯器，視具體分工而定。這次也不例外，漢公主攻了資源文件的破解和資源抽取，資源的打包還沒做，腳本這塊就暫時交給了我。一般，如果腳本是沒經過處理的文本，那也就沒我什麼事了；這次遇到的果然還是經過處理了的二進制腳本。

一拿到已經從Script.dat中提取出來的腳本文件，我嚇了一跳：文件名居然都是MD5……漢公那邊果然還沒把資源破解完善。不過沒關系，只要文件內容是對的就能開工。可以確認的是，腳本（准確說是給到我手上的腳本）的後綴名是tkn。

打開其中的第一個文件，0a69b4afebd6d64527a21e3f1aa993f9.tkn。內容如下：

Java代碼

Offset      0  1  2  3  4  5  6  7   8  9  A  B  C  D  E  F
00000000   54 4F 4B 45 4E 53 45 54  64 00 00 00 76 08 00 00   TOKENSETd...v...
00000010   0C 00 00 00 85 23 00 0C  00 00 00 81 62 61 73 65   ....?.....｜ase
00000020   5F 70 61 74 68 00 0C 00  00 00 83 2E 2E 2F 00 16   _path.....?./..
00000030   00 00 00 85 23 00 16 00  00 00 81 69 6E 63 6C 75   ...?.....（nclu
00000040   64 65 00 16 00 00 00 83  53 63 72 69 70 74 2F 42   de.....ゴcript/B
00000050   61 73 65 49 6E 73 74 72  75 63 74 69 6F 6E 2E 74   aseInstruction.t
00000060   78 74 00 20 00 00 00 81  6D 6F 74 69 6F 6E 00 20   xt. ...［otion. 
00000070   00 00 00 81 4D 61 69 6E  00 20 00 00 00 85 28 00   ...｀ain. ...?.

讀起來似乎很郁悶（？），其實看到有那麼多ASCII字符我已經很開心了。可以辨認出最開頭的 TOKENSET（但此時還無法判斷那個d是什麼）、ase_path、nclude等等。進一步觀察可以發現那些看似被剪掉了的字符都在，前面的base_path、include就是如此。編輯器裡顯示不出來只是因為大於0x7F的字節被解釋成雙字節字符編碼（DBCS）中一個雙字節字符的首字節，也就是例如說0x81把base_path中的b （0x62）給“吃”了。

在上述截圖范圍內，我總共識別出了這些：base_path、include、Script/BaseInstruction.txt、 motion、Main等字串。觀察它們前後的規律：這些字串總是以0結尾，是標准的C string；這些字串的前面總是有一個大於0x7F的字節（留意到0x81和0x83），而在那個字節之前似乎總是有3個00字節，前面又是一個非00的字節。

為了方便分析，我寫了一個小程序來抽取出我感興趣的信息，輔助分析。

對應上面內容而提出出來的內容：

（格式是：字符串起始地址一個奇怪的數字字符串之前的那個字節字符串內容）

Java代碼

0x1C 0xC 0x81 base_path
0x2B 0xC 0x83 ../
0x3B 0x16 0x81 include
0x48 0x16 0x83 Script/BaseInstruction.txt
0x68 0x20 0x81 motion
0x74 0x20 0x81 Main

經觀察，發現字符串之前的那個字節似乎是某種操作碼或者類型，而再前面的那個似乎是個什麼奇怪的數字，會連續有好幾個相同的，然後又增大一點。

接下來，突然發覺原來0x85也是個重要的數值；也有以這個數值打頭的字符串，但一般都是長度為一的符號，所以先前被忽略了。想了想，干脆把0x80開始到0x88開頭的，其之前是三個00的東西全部都掃描一遍。於是在之前的程序上修改了一下判斷條件，得到下面代碼：

opcode_analysis.cs：

C#代碼

1.using System; 2.using System.Collections.Generic; 3.using System.IO; 4.using System.Text; 5. 6.namespace FFDSystemAnalysis 7.{ 8. sealed class Analyzer 9. { 10. private static readonly byte[ ] SIGNATURE = { 11. ( byte )0x54, ( byte )0x4F, ( byte )0x4B, ( byte )0x45, 12. ( byte )0x4E, ( byte )0x53, ( byte )0x45, ( byte )0x54, 13. ( byte )0x64, ( byte )0x0, ( byte )0x0, ( byte )0x0 14. }; 15. 16. static void Main( string[ ] args ) { 17. if ( !args[ 0 ].EndsWith( ".tkn" ) ) return; 18. if ( !File.Exists( args[ 0 ] ) ) return; 19. 20. string infile = args[ 0 ]; 21. string outfile = infile + ".txt"; 22. 23. Encoding utf16le = new UnicodeEncoding( false, true ); 24. Encoding jis = Encoding.GetEncoding( 932 ); 25. 26. using ( BinaryReader reader = new BinaryReader( File.OpenRead( infile ), jis ) ) { 27. using ( BinaryWriter writer = new BinaryWriter( File.Create( outfile ), utf16le ) ) { 28. byte[ ] sig = reader.ReadBytes( SIGNATURE.Length ); 29. if ( !Equals( sig, SIGNATURE ) ) { 30. Console.WriteLine( "Wrong signature" ); 31. return; 32. } 33. 34. // write UTF-16LE BOM 35. writer.Write( ( ushort ) 0xFEFF ); 36. 37. Queue<byte> queue = new Queue<byte>( 3 ); 38. queue.Enqueue( reader.ReadByte( ) ); 39. queue.Enqueue( reader.ReadByte( ) ); 40. queue.Enqueue( reader.ReadByte( ) ); 41. 42. byte lastOpcode = 0; 43. while ( reader.BaseStream.Position < reader.BaseStream.Length ) { 44. byte currentByte = reader.ReadByte( ); 45. if ( currentByte == 0x080 46. || currentByte == 0x081 47. || currentByte == 0x082 48. || currentByte == 0x083 49. || currentByte == 0x084 50. || currentByte == 0x085 51. || currentByte == 0x086 52. || currentByte == 0x087 53. || currentByte == 0x088 ) { 54. if ( MatchQueueData( queue ) ) { 55. long position = reader.BaseStream.Position; 56. string line = ReadCString( reader ); 57. Entry e = new Entry( ) { 58. position = position, 59. opcode = currentByte, 60. lastOpcode = lastOpcode, 61. line = line 62. }; 63. writer.Write( 64. utf16le.GetBytes( 65. string.Format( "{0}{1}", 66. e.ToString( ), 67. Environment.NewLine ) 68. ) 69. ); 70. } // if 71. } // if 72. 73. // re-initialize 74. lastOpcode = queue.Dequeue( ); 75. queue.Enqueue( currentByte ); 76. } // while 77. } // using 78. } // using 79. } // Main 80. 81. static bool Equals( byte[ ] a, byte[ ] b ) { 82. int len = a.Length; 83. if ( len != b.Length ) return false; 84. for ( int i = 0; i < len; i++ ) { 85. if ( a[ i ] != b[ i ] ) return false; 86. } 87. return true; 88. } 89. 90. static bool MatchQueueData( Queue<byte> queue ) { 91. byte[ ] array = queue.ToArray( ); 92. return Equals( zeros, array ); 93. } 94. 95. static string ReadCString( BinaryReader reader ) { 96. StringBuilder builder = new StringBuilder( ); 97. char c = '\0'; 98. 99. while ( ( c = reader.ReadChar( ) ) != '\0' ) { 100. builder.Append( c ); 101. } 102. 103. return builder.ToString( ); 104. } 105. 106. static readonly byte[ ] zeros 107. = new byte[ ] { 0, 0, 0 }; 108. } 109. 110. struct Entry 111. { 112. public long position; 113. public string line; 114. public byte opcode; 115. public byte lastOpcode; 116. 117. public override string ToString( ) { 118. return string.Format( "0x{0:X} 0x{1:X} 0x{2:X} {3}", 119. this.position, this.lastOpcode, this.opcode, this.line ); 120. } 121. } 122.}

這段代碼本身沒什麼稀奇，只有第57行到62行的內容有點詭異：居然把變量賦值給自己了？

不不，再怎麼說我也不可能犯這種錯誤。這其實是C# 3.0裡的一個有趣語法，initializer。可以通過 initializer，在使用new關鍵字構造新實例的時候指定其中一些字段的值；等號左邊的是字段名，右邊則是字面量或者變量名（或者表達式）。編譯器能夠正確識別出看似是同名字的token之間的區別，因而能夠正確賦值。好吧我承認這不是好的編程習慣，大家看到了千萬不要學，要引以為戒……

另外，那個if裡一大堆對currentByte的判斷後來也重構到外面一個單獨的MatchOpcode()方法裡去了。像上面這樣寫實在太惡心……也要引以為戒哦

雖然沒什麼稀奇，還是說下這個文件裡的流程：

1、檢查作為參數文件是否存在，並且是否後綴為tkn。檢查不通過則退出程序。

2、獲取一個Shift-JIS和一個UTF-16LE字符集的Encoding實例，並使用它們創建Shift-JIS的輸入流和 UTF-16LE的輸出流。

3、校驗腳本文件的特征碼（signature）。這裡假設頭12個字節都是特征碼。

4、校驗成功後，給輸出流寫出一個字節序標記（BOM，Byte Order Mark）。這本來應該不需要手工做的，但我一直沒弄清楚為什麼我明明在創建utf16le時指定要BOM系統卻不幫我自動做……

5、創建一個隊列來記錄最近的三個字節。使用一個變量（lastOpcode）來記錄最近的第四個字節。

6、掃描文件直到遇到文件結束。如果遇到了連續的3個00，則讀入其後的一個字節，並判斷是否在 [0x80, 0x88]的范圍內；滿足的話則讀入一個C string並輸出記錄。

7、程序結束。

於是我得到了更新版的記錄：

（格式與前面相同）

Java代碼

0x15 0xC 0x85 #
0x1C 0xC 0x81 base_path
0x2B 0xC 0x83 ../
0x34 0x16 0x85 #
0x3B 0x16 0x81 include
0x48 0x16 0x83 Script/BaseInstruction.txt
0x68 0x20 0x81 motion
0x74 0x20 0x81 Main

於是我恍然大悟：那“奇怪的數字”居然是腳本源文件行號！而被認為是操作碼或者類型的那個字節，則用於指定後面字符串的類型：可以是符號、十進制數字、十六進制數字、標識符、字符串、符號等。

但位於腳本的0xC到0xF的那個數字（上圖紫色部分）是什麼意思還讓我傷了下腦筋。觀察了一下，發現從0a69b4afebd6d64527a21e3f1aa993f9.tkn提取出來的“東西”一共有1237個，而那意義不明的數字是 0x876 = 2166，還差了不少。但總覺得它們應該有關系。突然想起我前面是用了個很糟糕的辦法來提取記錄，有連續的3個00字節才滿足條件。但假如行號超過了0xFF = 255行的話這個條件就不成立了。趕緊把程序修改為第三版，按照新的理解去讀入“行號”和“類型”兩個數據，確認那個數字確實就是文件裡總的token數。

然後我才理解了signature裡那TOKENSET的含義……這看似是二進制的腳本其實根本沒有編譯過的二進制腳本之魂。

編譯的前端至少有兩部，scan和parse。Scan階段處理詞法分析，會把源文件切分成一個個token，而 parse階段處理文法分析，會根據上下文無關文法來嘗試“理解”這些token，構造語法樹（進而構造抽象語法樹）。但這裡我所看到的腳本只對腳本源文件做了scan，然後直接把scan的結果保存成“二進制腳本 ”了。真夠OTL的。

簡單點說，這個“二進制腳本”完整保留了腳本源文件的文本信息，而且還多加了些行號、類型等信息進去。缺少的是被去除了的注釋。

那就很好辦了不是麼。於是把所謂的反編譯程序寫了出來：

ScriptDecompiler.cs

C#代碼

1.// ScriptDecompiler.cs, 2007/12/18 2.// by RednaxelaFX 3. 4./* 5. * Copyright (c) 2007 著作權由RednaxelaFX所有。著作權人保留一切權利。 6. * 7. * 這份授權條款，在使用者符合以下三條件的情形下，授予使用者使用及再散播本 8. * 軟件包裝原始碼及二進位可執行形式的權利，無論此包裝是否經改作皆然： 9. * 10. * * 對於本軟件源代碼的再散播，必須保留上述的版權宣告、此三條件表列，以 11. * 及下述的免責聲明。 12. * * 對於本套件二進位可執行形式的再散播，必須連帶以文件以及／或者其他附 13. * 於散播包裝中的媒介方式，重制上述之版權宣告、此三條件表列，以及下述 14. * 的免責聲明。 15. * * 未獲事前取得書面許可，不得使用RednaxelaFX之名稱， 16. * 來為本軟件之衍生物做任何表示支持、認可或推廣、促銷之行為。 17. * 18. * 免責聲明：本軟件是由RednaxelaFX以現狀（"as is"）提供， 19. * 本軟件包裝不負任何明示或默示之擔保責任，包括但不限於就適售性以及特定目 20. * 的的適用性為默示性擔保。RednaxelaFX無論任何條件、 21. * 無論成因或任何責任主義、無論此責任為因合約關系、無過失責任主義或因非違 22. * 約之侵權（包括過失或其他原因等）而起，對於任何因使用本軟件包裝所產生的 23. * 任何直接性、間接性、偶發性、特殊性、懲罰性或任何結果的損害（包括但不限 24. * 於替代商品或勞務之購用、使用損失、資料損失、利益損失、業務中斷等等）， 25. * 不負任何責任，即在該種使用已獲事前告知可能會造成此類損害的情形下亦然。 26. */ 27. 28.using System; 29.using System.Collections.Generic; 30.using System.IO; 31.using System.Text; 32. 33.namespace FFDSystemAnalysis 34.{ 35. enum TokenType 36. { 37. Decimal = 0x080, 38. Identifier = 0x081, 39. Hexadecimal = 0x082, 40. String = 0x083, 41. Operator = 0x085 42. } 43. 44. sealed class ScriptDecompiler 45. { 46. private static readonly byte[ ] SIGNATURE = { 47. ( byte ) 0x54, ( byte )0x4F, ( byte )0x4B, ( byte )0x45, 48. ( byte )0x4E, ( byte ) 0x53, ( byte )0x45, ( byte )0x54, 49. ( byte )0x64, ( byte )0x0, ( byte )0x0, ( byte )0x0 50. }; 51. 52. static void Main( string[ ] args ) { 53. if ( !args[ 0 ].EndsWith( ".tkn" ) ) return; 54. if ( !File.Exists( args[ 0 ] ) ) return; 55. 56. string infile = args[ 0 ]; 57. string outfile = Path.GetFileNameWithoutExtension( infile ) + ".txt"; 58. 59. Encoding utf16le = new UnicodeEncoding( false, true ); 60. Encoding jis = Encoding.GetEncoding( 932 ); 61. 62. using ( BinaryReader reader = new BinaryReader( File.OpenRead( infile ), jis ) ) { 63. using ( BinaryWriter writer = new BinaryWriter( File.Create( outfile ), utf16le ) ) { 64. byte[ ] sig = reader.ReadBytes( SIGNATURE.Length ); 65. if ( !Equals( sig, SIGNATURE ) ) { 66. Console.WriteLine( "Wrong signature" ); 67. return; 68. } 69. 70. // write UTF-16LE BOM 71. writer.Write( ( ushort ) 0xFEFF ); 72. 73. // process each token 74. int lineNum = 1; 75. int lastLineNum = 1; 76. TokenType tokenType = TokenType.Operator; 77. TokenType lastTokenType = TokenType.Operator; 78. int tabCount = 0; 79. int tokenCount = reader.ReadInt32( ); 80. for ( int tokenNum = 0; tokenNum < tokenCount; ++tokenNum ) { 81. // deal with line numbers, insert empty new lines if needed 82. lineNum = reader.ReadInt32( ); 83. if ( lastLineNum < lineNum ) { // should write on a newline 84. // write empty lines 85. for ( int i = lastLineNum; i < lineNum; ++i ) { 86. writer.Write( utf16le.GetBytes( Environment.NewLine ) ); 87. } 88. // write tabs as indent 89. for ( int tabs = 0; tabs < tabCount; ++tabs ) { 90. writer.Write( utf16le.GetBytes( "\t" ) ); 91. } 92. // put a dummy value into tokenType 93. lastTokenType = TokenType.Operator; 94. } 95. 96. // get token tokenType 97. tokenType = ( TokenType ) ( reader.ReadByte( ) & 0x0FF ); 98. 99. // get token value 100. string tokenString = ReadCString( reader ); 101. 102. // deal with different token types 103. if ( !( lastTokenType == TokenType.Operator 104. || lastTokenType == TokenType.String 105. || lastTokenType == TokenType.Decimal 106. || lastTokenType == TokenType.Hexadecimal ) ) { 107. writer.Write( utf16le.GetBytes( " " ) ); 108. } 109. switch ( tokenType ) { 110. case TokenType.Decimal: 111. case TokenType.Identifier: 112. case TokenType.Hexadecimal: 113. writer.Write( utf16le.GetBytes( tokenString ) ); 114. break; 115. 116. case TokenType.String: 117. writer.Write( 118. utf16le.GetBytes( string.Format( "\"{0}\"", tokenString ) ) ); 119. break; 120. 121. case TokenType.Operator: 122. switch ( tokenString ) { 123. case "#": 124. case "%": 125. case "-": 126. case "@": 127. writer.Write( utf16le.GetBytes( tokenString ) ); 128. break; 129. case "{": 130. ++tabCount; 131. writer.Write( utf16le.GetBytes( tokenString ) ); 132. break; 133. case "}": 134. --tabCount; 135. writer.BaseStream.Position -= 2; // delete the last tab 136. writer.Write( utf16le.GetBytes( tokenString ) ); 137. break; 138. case "(": 139. case ",": 140. case ";": 141. case "=": 142. writer.Write( 143. utf16le.GetBytes( string.Format( "{0} ", tokenString ) ) ); 144. break; 145. case ")": 146. writer.Write( 147. utf16le.GetBytes( string.Format( " {0}", tokenString ) ) ); 148. break; 149. } // switch tokenString 150. break; 151. 152. default: 153. Console.WriteLine( "Unexpected token type {0} at 0x{1}.", 154. tokenType.ToString( "X" ), 155. reader.BaseStream.Position.ToString( "X" ) ); 156. return; 157. } // switch tokenType 158. 159. // re-initialize 160. lastLineNum = lineNum; 161. lastTokenType = tokenType; 162. } // for 163. } 164. } 165. } 166. 167. static bool Equals( byte[ ] a, byte[ ] b ) { 168. int len = a.Length; 169. if ( len != b.Length ) return false; 170. for ( int i = 0; i < len; i++ ) { 171. if ( a[ i ] != b[ i ] ) return false; 172. } 173. return true; 174. } 175. 176. static string ReadCString( BinaryReader reader ) { 177. StringBuilder builder = new StringBuilder( ); 178. char c = '\0'; 179. 180. while ( ( c = reader.ReadChar( ) ) != '\0' ) { 181. builder.Append( c ); 182. } 183. 184. return builder.ToString( ); 185. } 186. } 187.}

中間有些代碼是為了插入縮進的，忽略那部分吧……

得到的腳本看起來像是這樣：

Java代碼

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.#base_path 

"../"
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.#includ

e 

"Script/BaseInstruction.txt"
22.
23.
24.
25.
26.
27.
28.
2

9.
30.
31.motion Main (  )

中間是有很多空行沒錯。那些原本應該是有注釋的地方，或者本身就是空行（為了讓代碼好看）。這裡我只是把原始腳本的狀態盡量復原了出來而已。

暫時來說，這樣就夠用了。這個腳本處理已經讓我們能做很多事。要進一步做的話，我可以把文法分析也做出來，方便對腳本更仔細的分析。但這兩天肯定是沒時間做那種事情咯……

Until then...

P.S. 上述代碼皆以BSD許可證的形式發布。請有興趣的人在遵循BSD許可證的前提下自由使用這些代碼。

P.S.S. 其實上面代碼值得吐槽的地方N多。例如說我完全沒使用try-catch語句來處理可能出現的異常，又例如我在第一份代碼裡把一個Queue轉變成數組再做相等性比較（極其惡心，本來自己寫個循環數組就解決了）。……這些都是所謂的“原型代碼”，目標是盡可能快的寫出代碼來驗證自己的一些設想是否正確。偷懶不加異常處理、寧可別扭的使用標准庫裡的容器也不自己封裝一個，都是出於同一原因。看倌們請多多包涵這些地方 XD

P.S.S.S. 唉，不過我偷懶也真是過分了。後一份代碼裡居然沒把BinaryWriter改回用StreamWriter… …