Lua中特殊字符过滤(UTF8编码)

作者: 张老虎 | 来源:发表于2019-10-03 14:34 被阅读0次

Lua中特殊字符过滤(UTF8编码)
ios 对URL字符串编解码
lua中字符串过滤表情及特殊字符
utf-8编码的汉字
一个方法解决 NodeMCU 下 UTF8转UNICODE 20
网址URL中特殊字符转义编码
NSURL包含特殊字符
2018-03-29
（转载）MySql中UTF8 和 GBK 编码中文字符长度问题
html字符串解码，好

提纲

思路

中文Unicode

Unicode和UTF8的联系

常见特殊字符

过滤特殊字符

思路

常见的特殊字符有很多，查了很多资料，没找到特殊字符的Unicode编码范围，即使找到了也难以保证覆盖了全部。因此只能从非的角度考虑, 实现目标是留下操作系统支持的可作为文件名的字符。

中文Unicode编码

摘自 https://www.qqxiuzi.cn/zh/hanzi-unicode-bianma.php

<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">字符集</th>
<th scope="col" class="org-left">字数</th>
<th scope="col" class="org-left">Unicode编码</th>
</tr>
</thead>

<tr>
<td class="org-left">基本汉字补充</td>
<td class="org-left">74字</td>
<td class="org-left">9FA6-9FEF</td>
</tr>

<tr>
<td class="org-left">PUA增补</td>
<td class="org-left">207字</td>
<td class="org-left">E600-E6CF</td>
</tr>

其中只需要考虑基本汉字字符集即可。

根据字符的UTF8编码获取Unicode

UTF8和Unicode的关系网上资料很多, 在此不再赘述，简而言之，中文的UTF8编码都是三个字节，1110xxxx 10xxxxxx 10xxxxxx, 剩余的16位正好放下Unicode编码的两个字节，因此只要取出这16位即可知道该字符的Unicode

Lua不支持位操作， b1 % 0xe0 代表 b1 & 0xe0，*2^{12代表左移12位}，依次类推

local b1 = string.byte(str, curIndex)
local b2 = string.byte(str, curIndex + 1)
local b3 = string.byte(str, curIndex + 2)
local unic = (b1 % 0xe0) * 2 ^ 12 + (b2 % 0x80) * 2 ^ 6 + (b3 % 0x80);

需要过滤掉的特殊字符

ASCII中Windows不支持作为文件名的字符正则: [\\\\/:*?\"<>|%s+ ]
两个字节的UTF
UTF编码在四个字节及四个字节以上的字符

可以使用此页面内的特殊字符进行测试: https://wenku.baidu.com/view/fddf6408844769eae009ed14.html?re=view

代码实现

-- 过滤中文特殊字符
function filterInvalidChars(str)
  local result = '';
  local curIndex = 1;
  -- 逐字检查, 符合要求则放入result
  repeat
    local curByte = string.byte(str, curIndex)
    if curByte > 0 and curByte <= 127 then
      result = result..string.sub(str, curIndex, curIndex)
      curIndex = curIndex + 1
    elseif curByte >= 192 and curByte <= 223 then
      curIndex = curIndex + 2
    elseif curByte >= 224 and curByte <= 239 then
      -- 此处判断一些中文特殊字符
      local b1 = curByte
      local b2 = string.byte(str, curIndex + 1)
      local b3 = string.byte(str, curIndex + 2)
      local unic = (b1 % 0xe0) * 2 ^ 12 + (b2 % 0x80) * 2 ^ 6 + (b3 % 0x80)
      if unic >= 0x4e00 and unic <= 0x9FA5 then
        result = result..string.sub(str, curIndex, curIndex + 2)
      end
      curIndex = curIndex + 3
    elseif curByte >= 240 and curByte <= 247 then
      curIndex = curIndex + 4
    else
      logger:error('filter invalid chars error: '..str)
      return str
    end
  until(curIndex >= #str);
  return string.gsub(result, '[\\\\/:*?\"<>|%s+ ]', '');
end

网友评论

本文标题：Lua中特殊字符过滤(UTF8编码)

本文链接：https://www.haomeiwen.com/subject/aftluctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Lua中特殊字符过滤(UTF8编码)

Table of Contents

提纲

思路

中文Unicode

Unicode和UTF8的联系

常见特殊字符

过滤特殊字符

思路

中文Unicode编码

根据字符的UTF8编码获取Unicode

需要过滤掉的特殊字符

代码实现

相关文章

Lua中特殊字符过滤(UTF8编码)

ios 对URL字符串编解码

lua中字符串过滤表情及特殊字符

utf-8编码的汉字

一个方法解决 NodeMCU 下 UTF8转UNICODE 20

网址URL中特殊字符转义编码

NSURL包含特殊字符

2018-03-29

（转载）MySql中UTF8 和 GBK 编码中文字符长度问题

html字符串解码，好

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读