在我看來,程序員的個人修養的成長分為三個階段。第一個階段是懵懂階段,在這個階段不知道代碼為什麼要這麼寫,只知道這樣寫可以運行。第二個階段是求知階段,這個階段是想知道為什麼,然後搜索翻看資料,看看別人說這個是為什麼。第三個階段是探索,當不知道為什麼的時候,運用自己的技術理論和事件去弄明白為什麼。
網絡上也有很多關於字符串常量的討論和文章,這些文章幾乎都是從理論知識或者是“一種顯然是這樣”的角度來講解C語言字符串常量,而且裡面也有一些看似合理但是錯誤的言論。今天我打算在這裡跟大家分享下我對於C語言字符串常量以及C語言字符串常量的存儲的理解。
首先我們來看看C語言標准《C99》中對於字符串常量的定義:
Description
A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz". A wide string literal is the same, except prefixed by the letter L.
The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".
Semantics
In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and wide string literal tokens are concatenated into a single multibyte character sequence. If any of the tokens are wide string literal tokens, the resulting multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal.
In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals.66) The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence; for wide string literals, the array elements have type wchar_t, and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by the mbstowcs function with an implementation-defined current locale. The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined.
It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
EXAMPLE This pair of adjacent character string literals
"\x12" "3"
produces a single character string literal containing the two characters whose values are '\x12' and '3', because escape sequences are converted into single members of the execution character set just prior to adjacent string literal concatenation.
上面這段關於C語言字符串常量的表述,總結下,主要說了一下幾點:
1.字符串常量使用雙引號引起來的。
2.字符串常量的類型是char類型的數組char []。
3.字符串常量是以0(0在雙引號中轉義之後就是我們熟悉的\0)結尾。
4.修改字符串常量的行為是未定義的。
其中字符串常量是char []類型的,而不是const char [],原因很簡單,因為K&R C中麼有const,const是ANSI C增加的,請參考《http://www.math.pku.edu.cn/teachers/qiuzy/c/ansi-kr-c.htm》。
對常量字符串的修改是未定義的,在操作系統和編譯器實現的過程中,往往是會報沒有權限的錯誤。對於這一點,我們可以看看下面的代碼:
1 void print(char*)
2 {
3 // ...
4 }
5
6 while (1)
7 {
8 print("Hello");
9 }
如果在"Hello"在print內部被更改,我們傳進去的是"Hello",它在內存的內容已經不是"Hello"了,而是"I am fine, and you?"會不會很奇怪?對於這一點,不知道Ritchie在設計C語言的時候究竟是怎麼想的。
那麼C語言的字符串常量究竟是如何存儲的呢?
我們再來看一個Demo:
1 // main.c
2 #include <unistd.h>
3 #include <stdio.h>
4 #include <stdlib.h>
5 #include <string.h>
6
7 static char* sayHi = "Hi Fene!";
8 static char* sayHello = "Hello Fene!";
9 static char* dummyHi = "Hi Fene!";
10
11 void print()
12 {
13 printf("sayHi:%s, sayHello:%s, dummyHi:%s\n", sayHi, sayHello, dummyHi);
14 }
15
16 int main()
17 {
18 print();
19 return 0;
20 }
編譯命令如下:
gcc -m32 -Wall main.c
我們使用objdump工具
objdump -s a.out
dump出a.out的內容如下:
0000043c <increase>:
Contents of section .rodata:
80484f8 03000000 01000200 48692046 656e6521 ........Hi Fene!
8048508 0048656c 6c6f2046 656e6521 00000000 .Hello Fene!....
8048518 73617948 693a2573 2c207361 7948656c sayHi:%s, sayHel
8048528 6c6f3a25 732c2064 756d6d79 48693a25 lo:%s, dummyHi:%
8048538 730a00 s..
其中第一列是對應的內存單元地址,後面緊跟著的是幾列是內存中內容的16進制數據。
你會發現,剛剛Demo出現的所有字符串都被存放在.rodata區域,也就是只讀區。每個字符串常量後面都跟了一個0。.rodata數據是在ELF文件被載入的時候,在內存中分配的內存,其生命周期同進程的生命周期。
另外,還有一點,值得注意的是,sayHi和dummyHi所指向的內容都是"Hi Fene!",但是你發發現在.rodata出現了2分"Hi Fene!",也就意味著sayHi和dummyHi指針的值是不一樣了,他們指向了不同的內存單元地址,我在Linux X86/X64/ARM上都有驗證過。這也就是為什麼不能通過判斷字符串常量的指針來判斷2個字符串是否相等的原因。但是在我看到的一些關於C語言字符串常量的資料中,有講到,為了節省內存,代碼中相同的字符串常量都指向同一個字符串常量,這一點我覺得有待於考證,至少Linux/X86/X64/ARM上不是這樣的。