Group commit and real fsync
分組提交和實時fsync
During the recent months I’ve seen few cases of customers upgrading to MySQL 5.0 and having serious performance slow downs, up to 10 times in certain cases. What was the most surprising for them is the problem was hardware and even OS specific - it could show up with one OS version but not in the other. Even more interesting performance may be dramatically affected by –log-bin settings, which usually has just couple of percent overhead. So what is going on?
最近這幾個月,我已經碰到少數幾個案例:一些客戶升級到 MySQL 5.0,結果性能嚴重下降,某些特定情況下甚至達到10倍以下。然而令他們最為驚訝的是,產生這些問題竟然是由於硬件甚至是操作系統 -- 在某個版本的操作系統上存在這些問題但在其他版本則沒有。更有趣的是,MySQL 的性能竟然戲劇性地受到 log-bin 設置的影響 -- 這通常只是對系統性能有 2% 的影響。那麼,到底發生什麼事了呢?
Actually we’re looking at two issues here which interleave such funny way
讓我們來找找這2個有趣的問題中交叉的地方吧:
Group commit is broken in MySQL 5.0 if binary loging is enabled (as it enables XA) 在 MySQL 5.0 中如果啟用二進制日志(binary log)(啟用XA也是如此),則分組提交中斷了 Certain OS/Hardware configurations still fake fsync delivering great performance at the cost of being non ACID 某些操作系統/硬件配置仍舊只是實現了偽 fsync,由於它是 非ACID,結果導致大量的性能損失
First one can be tracked by this bug. In the nutshell the problem is - new feature - XA was implemented in MySQL 5.0 which did not work with former group commit code. The new code for group commit however was never implemented. XA allows to keep different transactonal storage engines in sync, together with binary log. XA is enabled if binary log is enabled this is why this issue is trigered by enabled binary log. if binary log is disabled, so is XA and old group commit code works just fine.
第一個問題可以查看 這個bug。概括地說,這個問題是新特性 -- MySQL 5.0 中新增加了 XA 特性,它不支持舊的分組提交代碼。然而新的分組提交代碼還完全沒實現。XA 支持讓不同的事務性存儲引擎保持同步,都保存在二進制日志中。如果啟用了二進制日志,則 XA 也啟用了,這就是為什麼啟用二進制日之後會觸發這個問題。如果禁用二進制日志,則 XA 和舊的分組提交代碼就都沒問題了。
Second one is interesting. Actually we would hear much more people screaming about this problem if OS would be honest with us. Happily for us many OS/Hardware pairs are still lying about fsync(). fsync() call suppose to place data on the disk securely, which unless you have battery backed up cache would give you only 80-200 sequential fsync() calls per second depending on your hard drive speed. With fake fsync() call the data is only written to the drives memory and so can be lost if power goes down. However it gives great performance improvement and you might see 1000+ of fsync() calls per second. So if your OS is not giving you real fsync you might not notice this bug. The performance degradation will still happen but it will be much smaller, especially with large transactions.
第二個問題很有趣。事實上如果操作系統更加誠實的話,我們將會聽到更多的用戶的抱怨。幸好,對我們來說,不少操作系統/硬件組合還是基於 fsync() 之上。fsync() 調用假使安全地把數據放在磁盤中,除非有備用電池高速緩存依賴於磁盤的驅動速度才只能達到每秒 80 - 200 次連續的 fsync() 調用。而偽 fsync() 則只是把數據寫在磁盤內存中,一旦斷電了,這些數據就會丟失了。不過這麼做能獲得很高性能,大約能達到每秒有1000多次的 fsync() 調用。因此,如果你的操作系統不支持實時 fsync() 調用,就要注意這個bug。性能會被降低,不過這會越來越少,尤其是在很大的事務過程中。
So how you can solve the problem ?
那麼,如何解決這個問題呢?
Disable binary log. This could be option for slaves for example which do not need point in time recovery etc. 禁用二進制日志。這在那些不需要及時恢復的slave上這個是可選的,以及其他類似的情況下。 Check if you OS is doing real fsync. You should to know anyway if you care about your data safety. This can be done for example by using SysBench: sysbench –test=fileio –file-fsync-freq=1 –file-num=1 –file-total-size=16384 –file-test-mode=rndwr. This will write and fsync the same page and you should see how many requests/sec it is doing. You also might want to check diskTest from this page http://www.faemalia.Net/MySQLUtils/ which does some extra tests for fsync() correctness.
檢查你的操作系統是否支持實時 fsync()。如果你關心數據的安全性,則無論如何都必須要知道。這可以用 SysBench 來檢查: sysbench –test=fileio –file-fsync-freq=1 –file-num=1 –file-total-size=16384 –file-test-mode=rndwr. 。它會在同一個內存頁寫入和同步,你只要看一下每秒完成了多少次請求。也可以用 diskTest 來針對 fsync() 做這些檢查。 Install RAID with battery backed up cache. This gives about the same effect as fake fsync() but you can make it secure (However make sure your drives are not caching data by themselves). The good thing RAID with battery backed up cache are becoming really inexpensive.
安裝支持高速電池緩存的RAID。這麼做類似實現了偽 fsync(),不過更安全(它確保無需由磁盤驅動器自己來完成數據緩沖)。現在這個系統花費也不太貴。
You also probably want to know if this bug is going to be fixed ? I’m not authority in this question but as Heikki describes it as fundamental task I’m not sure it will be done in 5.0 Good if it is done in 5.1.
你也許想知道這個bug是否已經被修復了?對這個問題我無權回答,不過如 Heikki 所述,它是 MySQL 5.0 中的一項基礎工作,不知道在 5.1 中是否能夠完成。