美文网首页程序员工具箱
can-php-detect-4-byte-encoded-ut

can-php-detect-4-byte-encoded-ut

作者: 许一沐 | 来源:发表于2022-07-11 15:03 被阅读0次

https://stackoverflow.com/questions/16496554/can-php-detect-4-byte-encoded-utf8-chars

This should work:

if (max(array_map('ord', str_split($string))) >= 240)
The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

$cc = '𝓘𝓼𝓵𝓲𝓪𝓷𝓪';
$replaceTo = '&';
// $cc = 'น้อยกว่า';
// $cc = '𥄫';


$cc = preg_replace_callback('/./u', function (array $match) use ($replaceTo){
    return strlen($match[0]) >= 4 ? $replaceTo : $match[0];
}, $cc);

echo $cc;

Though there may be a more elegant regex way to express high codepoints directly.

相关文章

网友评论

    本文标题:can-php-detect-4-byte-encoded-ut

    本文链接:https://www.haomeiwen.com/subject/dgsnbrtx.html