业务数据指纹（MD5）的使用和存储注意

作者: MarkZhu | 来源:发表于2017-01-16 00:33 被阅读0次

业务数据指纹（MD5）的使用和存储注意
为什么需要数据建模？
7.阿里大数据——大数据建模
从0开始学大数据-数据仓库建模
数据建模目的
数据亲和架构--子集筛选
搞定万亿级MySQL海量存储的索引与分表设计实战
02数据字典，1根据销售业务模型确定目标及分析
2018-10-17
Python MD5加密详解以及多次加密的坑

md5

开始

有时由于存储优化或实现业务要求，或数据引用、去重要求。需要对业务数据计算和存储指纹信息。

以下分析几方面。
1.指纹算法选择
2.指纹输入的数据选择
3.指纹储存方法

指纹算法选择

理论hash冲突的可能性：
md5 > sha1 > shaXX

这里没什么心得。不过md5已经有彩虹表之类的冲突列表了。无意的冲突很难，但如果是有意的冲突和搞事，MD5就比较不安全。

业务数据指纹的指纹输入选择

选择MD5的输入公式，建议使用可以直接在sql中计算的公式，方便后台维护时在线查对和做数。可以考虑：

注意多字段合并计算md5时，需要加入分隔符（想想不加为什么不可以？），和考虑是否需要对分隔符冲突作转义(escape)处理。
使用时，需要注意DB和表的charset

select
  concat_ws('~',
    replace(field1,'~','~~'),
    replace(field2,'~','~~')
  );

Mysql中用好Binary类型做性能优化:

需要在mysql表中保存MD5值。考虑三个方案：
varchar(32)
char（32)
binary(16)

由于MD5值，实际为128-bit(16 byte)二进数据，字符只是一个方便人看的表达，所以应该用msql的binary(16)来保存最为节省空间（与varchar和CHAR相比）。sha1/shaX 等HASH算法也同理。减少IO和内存的操作，所以理论上同时可以提高性能。这在多表关联、排序时差别更大。

因为如果varchar（32）。在utf8 环境下最少要32*3+长度记数个字节。长度比binary16长几倍。而且，作为索引或比较时，varchar的忽略大小写的比较比直接的binary比较更加慢。所以理论来说，更应该考虑binary(16)。

人工select时，可以使用HEX/UNHEX方法把binary数据变为可读（http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_hex）。
如有需要，对于mysql 5.7+ 也可以用virtual列等方法去虚拟一个文本列。

附录：参考资料

hen generating our hash, we'll put an unlikely separator between each column during concatenation so that we won't have "1" and "23" getting confused with "12" and "3" as in this example:
mysql> select concat(1,23), md5(concat(1,23));+--------------+----------------------------------+| concat(1,23) | md5(concat(1,23)) |+--------------+----------------------------------+| 123 | 202cb962ac59075b964b07152d234b70 |+--------------+----------------------------------+1 row in set (0.00 sec)mysql> select concat(12,3), md5(concat(12,3));+--------------+----------------------------------+| concat(12,3) | md5(concat(12,3)) |+--------------+----------------------------------+| 123 | 202cb962ac59075b964b07152d234b70 |+--------------+----------------------------------+1 row in set (0.00 sec)
Instead, we'll do this:
mysql> select concat_ws('~',1,23), md5(concat_ws('~',1,23));+---------------------+----------------------------------+| concat_ws('~',1,23) | md5(concat_ws('~',1,23)) |+---------------------+----------------------------------+| 1~23 | 037ef90202e1e89a23016e4b51489326 |+---------------------+----------------------------------+1 row in set (0.00 sec)mysql> select concat_ws('~',12,3), md5(concat_ws('~',12,3));+---------------------+----------------------------------+| concat_ws('~',12,3) | md5(concat_ws('~',12,3)) |+---------------------+----------------------------------+| 12~3 | 4ba8224d8a784c8af2af98b4ceb034c6 |+---------------------+----------------------------------+1 row in set (0.00 sec)

http://dev.mysql.com/doc/refman/5.7/en/char.html ：
In contrast to CHAR
, VARCHAR
values are stored as a 1-byte or 2-byte length prefix plus data. The length prefix indicates the number of bytes in the value. A column uses one length byte if values require no more than 255 bytes, two length bytes if values may require more than 255 bytes.

http://dba.stackexchange.com/questions/2640/what-is-the-performance-impact-of-using-char-vs-varchar-on-a-fixed-size-field：

TRADEOFF #1 Obviously, VARCHAR holds the advantage since variable-length data would produce smaller rows and, thus, smaller physical files.

TRADEOFF #2 Since CHAR fields require less string manipulation because of fixed field widths, index lookups against CHAR field are on average 20% faster than that of VARCHAR fields. This is not any conjecture on my part. The book MySQL Database Design and Tuning performed something marvelous on a MyISAM table to prove this. The example in the book did something like the following:

http://stackoverflow.com/questions/59667/why-would-i-ever-pick-char-over-varchar-in-sql：
As was pointed out by Gaven in the comments, if you are using a multi-byte, variable length character set like UTF8 then CHAR stores the maximum number of bytes necessary to store the number of characters. So if UTF8 needs at most 3 bytes to store a character, then CHAR(6) will be fixed at 18 bytes, even if only storing latin1 characters. So in this case VARCHAR becomes a much better choice.

https://www.xaprb.com/blog/2009/02/12/5-ways-to-make-hexadecimal-identifiers-perform-better-on-mysql/

select * from t where id = x'0cc175b9c0f1b6a831c399e269772661';

MySQL 5.7:
create table users(
id_bin binary(16),

id_text varchar(36) generated always as

(insert(

insert(

  insert(

    insert(hex(id_bin),9,0,'-'),

    14,0,'-'),

  19,0,'-'),

24,0,'-')

) virtual,

name varchar(200));

http://www.ovaistariq.net/632/understanding-mysql-binary-and-non-binary-string-data-types/:
A CHAR(10) column would need 30 bytes for each value regardless of the actual value if utf8 character set is used, however, the same column would need 10 bytes for each value if a single-byte character set such as latin1 is used. Keeping these considerations in mind is very important.

http://www.techearl.com/mysql/how-to-store-md5-hashes-in-a-mysql-database：
INSERT INTO md5_test_binary (md5) VALUES (unhex('0800fc577294c34e0b28ad2839435945'));

网友评论

本文标题：业务数据指纹（MD5）的使用和存储注意

本文链接：https://www.haomeiwen.com/subject/eapubttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

业务数据指纹（MD5）的使用和存储注意

开始

指纹算法选择

业务数据指纹的指纹输入选择

Mysql中用好Binary类型做性能优化:

附录：参考资料

相关文章

业务数据指纹（MD5）的使用和存储注意

为什么需要数据建模？

7.阿里大数据——大数据建模

从0开始学大数据-数据仓库建模

数据建模目的

数据亲和架构--子集筛选

搞定万亿级MySQL海量存储的索引与分表设计实战

02数据字典，1根据销售业务模型确定目标及分析

2018-10-17

Python MD5加密详解以及多次加密的坑

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读