摘要:MySQL支持的字符集简介
从根源上讲,计算机只能识别二进制代码,我们利用计算机处理的所有事情,最终都转换成了二进制码。为了使计算机能够处理文字,人们想出了给每个文字符号编码以便于计算机识别处理的办法,这就是字符集的由来。
简单的说,字符集就是一套文字符号及其编码、比较规则的集合。其从20世纪60年代发展到现在,已经非常成熟和稳定,下面是一些常见的字符集。
字符集 | 是否定长 | 编码方式 | 说明 |
---|---|---|---|
ACSII | 是 | 单字节7位编码 | 最早的奠基性字符集 |
ISO-8859-1/latin1 | 是 | 单字节8位编码 | 西欧字符集,latin1是ISO-8859-1的别名,有些环境下写作latin-1 |
GB 2312 | 是 | 双字节编码 | 全名是GB 2312-1980,早期国标,不再推荐使用 |
GBK | 是 | 双字节编码 | GBK是GB2312的扩展,非国标,但使用广泛 |
GB 18030 | 否 | 2字节或4字节编码 | 国标,是GBK的超集,并与其兼容 |
UTF-32 | 是 | 4字节编码 | 也叫UCS-4,使用较少 |
UCS-2 | 是 | 2字节编码 | |
UTF-16 | 否 | 2字节或4字节编码 | 不兼容ASCII |
UTF-8 | 否 | 1~4字节编码 | 被广泛使用的Unicode字符集 |
字符集对数据库来说非常重要,因为数据库存储的大部分都是各种文字,字符集对数据的存储、处理性能、甚至迁移升级等都有影响。
MySQL 5.7目前支持几十种字符集。
root@database-one 13:40: [(none)]> show character set; +----------+---------------------------------+---------------------+--------+ | Charset | Description | Default collation | Maxlen | +----------+---------------------------------+---------------------+--------+ | big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 | | dec8 | DEC West European | dec8_swedish_ci | 1 | | cp850 | DOS West European | cp850_general_ci | 1 | | hp8 | HP West European | hp8_english_ci | 1 | | koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 | | latin1 | cp1252 West European | latin1_swedish_ci | 1 | | latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 | | swe7 | 7bit Swedish | swe7_swedish_ci | 1 | | ascii | US ASCII | ascii_general_ci | 1 | | ujis | EUC-JP Japanese | ujis_japanese_ci | 3 | | sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 | | hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 | | tis620 | TIS620 Thai | tis620_thai_ci | 1 | | euckr | EUC-KR Korean | euckr_korean_ci | 2 | | koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 | | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 | | greek | ISO 8859-7 Greek | greek_general_ci | 1 | | cp1250 | Windows Central European | cp1250_general_ci | 1 | | gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 | | latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 | | armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 | | ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 | | cp866 | DOS Russian | cp866_general_ci | 1 | | keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 | | macce | Mac Central European | macce_general_ci | 1 | | macroman | Mac West European | macroman_general_ci | 1 | | cp852 | DOS Central European | cp852_general_ci | 1 | | latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 | | utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | | cp1251 | Windows Cyrillic | cp1251_general_ci | 1 | | utf16 | UTF-16 Unicode | utf16_general_ci | 4 | | utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 | | cp1256 | Windows Arabic | cp1256_general_ci | 1 | | cp1257 | Windows Baltic | cp1257_general_ci | 1 | | utf32 | UTF-32 Unicode | utf32_general_ci | 4 | | binary | Binary pseudo charset | binary | 1 | | geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 | | cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 | | eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 | | gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | +----------+---------------------------------+---------------------+--------+ 41 rows in set (0.00 sec)
可以看到,每种字符集包括Charset(字符集)和collation(校对规则)两个概念。字符集决定MySQL存储字符串的方式,校对规则决定比较字符串的方式。字符集和校对规则是一对多的关系,每个字符集至少对应一个校对规则。
root@database-one 13:42: [(none)]> show collation like 'gbk%'; +----------------+---------+----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +----------------+---------+----+---------+----------+---------+ | gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | | gbk_bin | gbk | 87 | | Yes | 1 | +----------------+---------+----+---------+----------+---------+ 2 rows in set (0.01 sec) root@database-one 13:48: [(none)]> SHOW COLLATION WHERE Charset = 'latin1'; +-------------------+---------+----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-------------------+---------+----+---------+----------+---------+ | latin1_german1_ci | latin1 | 5 | | Yes | 1 | | latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 | | latin1_danish_ci | latin1 | 15 | | Yes | 1 | | latin1_german2_ci | latin1 | 31 | | Yes | 2 | | latin1_bin | latin1 | 47 | | Yes | 1 | | latin1_general_ci | latin1 | 48 | | Yes | 1 | | latin1_general_cs | latin1 | 49 | | Yes | 1 | | latin1_spanish_ci | latin1 | 94 | | Yes | 1 | +-------------------+---------+----+---------+----------+---------+ 8 rows in set (0.00 sec)
可以看到gbk有两种校对规则,默认使用gbk_chinese_ci。latin1有八种校对规则,默认使用latin1_swedish_ci。
校对规则命名约定:以相关的字符集名开始,中间一般是语言名,以_ci结尾代表大小写不敏感,以_cs结尾代表大小写敏感,以_bin结尾代表基于字符编码值比较。
面对众多的字符集,我们该如何选择呢?一般是结合业务的特征来考虑,比如:业务要给不同语言的国家或地区使用,就应该选择Unicode字符集。业务只在某个区域用,就选相应的国家字符集,会更高效些。
© 2020, morinson. 版权所有. 欢迎转载,但请保留作者及出处。