![](https://img.haomeiwen.com/i3382609/251e0cc194a94324.png)
![](https://img.haomeiwen.com/i3382609/c4e1b58ee3a70008.png)
![](https://img.haomeiwen.com/i3382609/7b8ebcebad410e9e.png)
分词对于英语而言,是比较容易的,毕竟每个词之间都是space or punctuation.
但是其他语言可能就没这么方便了,比如德语、日语、中文。期间没有明显的spaces.
尤其是日语,压根没有空格at all.
虽然这样,对于人类而言,却毫无难度。
![](https://img.haomeiwen.com/i3382609/ff993b676e104d8d.png)
token: you can think it as a useful unit for semantic processing
![](https://img.haomeiwen.com/i3382609/a7da84b62fbd6837.png)
![](https://img.haomeiwen.com/i3382609/79895a4b72f868f5.png)
可见,这些tokens don't make any sense n't
所以tokenization如要让tokens具有意义。
![](https://img.haomeiwen.com/i3382609/342e2566666c9129.png)
![](https://img.haomeiwen.com/i3382609/700d9cfd1566021d.png)
![](https://img.haomeiwen.com/i3382609/e83dfa3f17a4c306.png)
![](https://img.haomeiwen.com/i3382609/713341d74fb69ea7.png)
![](https://img.haomeiwen.com/i3382609/dc52c5b0614533a2.png)
![](https://img.haomeiwen.com/i3382609/037e23ab879b059f.png)
![](https://img.haomeiwen.com/i3382609/8ebd6355158dcb94.png)
(from:https://www.coursera.org/learn/language-processing/lecture/SCd4G/text-preprocessing)
网友评论