美文网首页
用正则表达式匹配CJK 中文字符,日文字符和韩文字符

用正则表达式匹配CJK 中文字符,日文字符和韩文字符

作者: mudssky | 来源:发表于2021-07-01 20:51 被阅读0次

    用正则表达式匹配CJK 中文字符,日文字符和韩文字符

    中文字符范围

    详见unicode官网的一个文档

    https://www.unicode.org/versions/Unicode13.0.0/ch18.pdf

    中文字符的范围比较广,而且汉字有很多是和日文还有韩文是通用的(The Unicode Standard contains a set of unified Han ideographic characters used in the written Chinese, Japanese, and Korean languages).

    CJK指的就是中日韩

    下面直接把官网关于汉语的表格搬过来了.

    4E00–9FFF是最初修订的中文字符范围,包含了大部分常用内容了,所以一般用这个来匹配汉字就可以了,

    区块 范围 简述
    CJK Unified Ideographs 4E00–9FFF Common
    CJK Unified Ideographs Extension A 3400–4DBF Rare
    CJK Unified Ideographs Extension B 20000–2A6DF Rare, historic
    CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
    CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
    CJK Unified Ideographs Extension E 2B820–2CEAF Rare, historic
    CJK Unified Ideographs Extension F 2CEB0–2EBEF Rare, historic
    CJK Unified Ideographs Extension G 30000–3134F Rare, historic
    CJK Compatibility Ideographs F900–FAFF Duplicates, unifiable variants, corporate characters
    CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

    日文字符范围

    因为日语的汉字部分有很多和汉字的unicode码其实是公用的,所以实际上判断日文字符,你从平假名和片假名入手比较好.

    • 日语平假名的unicode码范围:3040–309F
    • 日语片假名的unicode码范围:30A0–30FF
    • 日文片假名拼音扩展31F0-31FF

    韩文字符范围

    • 韩文拼音:AC00-D7AF

    • 韩文字母:1100-11FF

    • 韩文兼容字母:3130-318F

    下面是网上找到的匹配日文的正则,作为参考留着

    
    Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
    ([一-龯])
    
    Regex for matching Hirgana or Katakana
    ([ぁ-んァ-ン])
    
    Regex for matching Non-Hirgana or Non-Katakana
    ([^ぁ-んァ-ン])
    
    Regex for matching Hirgana or Katakana or basic punctuation (、。’)
    ([ぁ-んァ-ン\w])
    
    Regex for matching Hirgana or Katakana and random other characters
    ([ぁ-んァ-ン!:/])
    
    Regex for matching Hirgana
    ([ぁ-ん])
    
    Regex for matching full-width Katakana (zenkaku 全角)
    ([ァ-ン])
    
    Regex for matching half-width Katakana (hankaku 半角)
    ([ァ-ン゙゚])
    
    Regex for matching full-width Numbers (zenkaku 全角)
    ([0-9])
    
    Regex for matching full-width Letters (zenkaku 全角)
    ([A-z])
    
    Regex for matching Hiragana codespace characters (includes non phonetic characters)
    ([ぁ-ゞ])
    
    Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters)
    ([ァ-ヶ])
    
    Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana)
    ([ヲ-゚])
    
    Regex for matching Japanese Post Codes
    /^¥d{3}¥-¥d{4}$/
    /^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/
    
    Regex for matching Japanese mobile phone numbers (keitai bangou)
    /^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
    /^0¥d0-¥d{4}-¥d{4}$/
    
    Regex for matching Japanese fixed line phone numbers
    /^[0-9-]{6,9}$|^[0-9-]{12}$/
    /^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/
    

    相关文章

      网友评论

          本文标题:用正则表达式匹配CJK 中文字符,日文字符和韩文字符

          本文链接:https://www.haomeiwen.com/subject/mxlsultx.html