判断shp文件编码
首先来讲下shp文件组成,一个shp文件通常由:shp、dbf、shx、prj四个文件组成,其中prj代表空间参考信息,shx存储了索引信息,有助于程序加快搜索效率,shp保存了元素的几何实体,最后dbf里面存放了每个几何形状的属性数据,所以我们将shp文件编码,实际上是讲dbf文件编码,因为只有dbf来说,才有可能存储GBK编码的中文或其他编码的其他语言的。
对于ArcGIS来说,用ArcGIS做的数据默认都是GBK编码的,因为我们的windows默认是gbk编码,也有一些情况下可能是utf8编码的。
dbf文件的第30个字节代表了编码类型(不是绝对的,不过我试了很多软件,大部分还是遵循这个标准的),所以我们只需要读取dbf的第30个字节,然后根据一个dbf编码表就能得到这个shp的编码了。
代码很简单:
package cn.dev;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
public class Learn09 {
public static void main(String[] args) throws Exception {
InputStream dbf = Learn09.class.getResourceAsStream("/point/point.dbf");
byte[] bytes = new byte[30];
dbf.read(bytes);
byte b = bytes[29];
System.out.println(Integer.toHexString(Byte.toUnsignedInt(b)));
}
}
对于这个shp来说,输出结果是
4d
我们从下表中找到4d,可以看到他对应的是gbk编码。
ID | Codepage | Description | |
---|---|---|---|
1 | 0x01 | 437 | US MS-DOS |
2 | 0x02 | 850 | International MS-DOS |
3 | 0x03 | 1252 | Windows ANSI Latin I |
4 | 0x04 | 10000 | Standard Macintosh |
8 | 0x08 | 865 | Danish OEM |
9 | 0x09 | 437 | Dutch OEM |
10 | 0x0A | 850 | Dutch OEM* |
11 | 0x0B | 437 | Finnish OEM |
13 | 0x0D | 437 | French OEM |
14 | 0x0E | 850 | French OEM* |
15 | 0x0F | 437 | German OEM |
16 | 0x10 | 850 | German OEM* |
17 | 0x11 | 437 | Italian OEM |
18 | 0x12 | 850 | Italian OEM* |
19 | 0x13 | 932 | Japanese Shift-JIS |
20 | 0x14 | 850 | Spanish OEM* |
21 | 0x15 | 437 | Swedish OEM |
22 | 0x16 | 850 | Swedish OEM* |
23 | 0x17 | 865 | Norwegian OEM |
24 | 0x18 | 437 | Spanish OEM |
25 | 0x19 | 437 | English OEM (Great Britain) |
26 | 0x1A | 850 | English OEM (Great Britain)* |
27 | 0x1B | 437 | English OEM (US) |
28 | 0x1C | 863 | French OEM (Canada) |
29 | 0x1D | 850 | French OEM* |
31 | 0x1F | 852 | Czech OEM |
34 | 0x22 | 852 | Hungarian OEM |
35 | 0x23 | 852 | Polish OEM |
36 | 0x24 | 860 | Portuguese OEM |
37 | 0x25 | 850 | Portuguese OEM* |
38 | 0x26 | 866 | Russian OEM |
55 | 0x37 | 850 | English OEM (US)* |
64 | 0x40 | 852 | Romanian OEM |
77 | 0x4D | 936 | Chinese GBK (PRC) |
78 | 0x4E | 949 | Korean (ANSI/OEM) |
79 | 0x4F | 950 | Chinese Big5 (Taiwan) |
80 | 0x50 | 874 | Thai (ANSI/OEM) |
87 | 0x57 | Current ANSI CP | ANSI |
88 | 0x58 | 1252 | Western European ANSI |
89 | 0x59 | 1252 | Spanish ANSI |
100 | 0x64 | 852 | Eastern European MS-DOS |
101 | 0x65 | 866 | Russian MS-DOS |
102 | 0x66 | 865 | Nordic MS-DOS |
103 | 0x67 | 861 | Icelandic MS-DOS |
104 | 0x68 | 895 | Kamenicky (Czech) MS-DOS |
105 | 0x69 | 620 | Mazovia (Polish) MS-DOS |
106 | 0x6A | 737 | Greek MS-DOS (437G) |
107 | 0x6B | 857 | Turkish MS-DOS |
108 | 0x6C | 863 | French-Canadian MS-DOS |
120 | 0x78 | 950 | Taiwan Big 5 |
121 | 0x79 | 949 | Hangul (Wansung) |
122 | 0x7A | 936 | PRC GBK |
123 | 0x7B | 932 | Japanese Shift-JIS |
124 | 0x7C | 874 | Thai Windows/MS–DOS |
134 | 0x86 | 737 | Greek OEM |
135 | 0x87 | 852 | Slovenian OEM |
136 | 0x88 | 857 | Turkish OEM |
150 | 0x96 | 10007 | Russian Macintosh |
151 | 0x97 | 10029 | Eastern European Macintosh |
152 | 0x98 | 10006 | Greek Macintosh |
200 | 0xC8 | 1250 | Eastern European Windows |
201 | 0xC9 | 1251 | Russian Windows |
202 | 0xCA | 1254 | Turkish Windows |
203 | 0xCB | 1253 | Greek Windows |
204 | 0xCC | 1257 | Baltic Windows |
本节代码可以在https://github.com/scially/GeosparkBook找到(Learn09.java)
网友评论