一、SAX解析的优点
SAX解析是逐行解析XML,占用内存小,采用dom4j解析(一次读入整个xml文件)内存不足时建议采用SAX解析。
二、SAX解析流程
给定一个xml文件
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE mdc SYSTEM "MeasDataCollection.dtd">
<attr name="Info">NO���///�///�/�/���</attr>
<MOTree>
<MO className="valueX1" fdn="valueX2">
<attr name="key1">value1</attr>
<attr name="key2">value2</attr>
<attr name="key3">value3</attr>
<MO className="valueY1" fdn="valueY2">
<attr name="keyA1">valueA1</attr>
<attr name="keyA2">valueA2</attr>
<attr name="keyA3">valueA3</attr>
<attr name="id">1</attr>
</MO>
<MO className="valueZ1" fdn="valueZ2">
<attr name="keyB1">valueB1</attr>
<attr name="keyB2">valueB2</attr>
<attr name="keyB3">valueB3</attr>
<attr name="id">3</attr>
</MO>
<MO className="valueXX1" fdn="valueXX2">
<attr name="keyC1">valueC1</attr>
<attr name="keyC2">valueC2</attr>
<attr name="keyC3">valueC3</attr>
<attr name="id">2</attr>
</MO>
</MO>
</MOTree>
本人工作中遇到的XML如上所示。不仅存在很多非法字符,而且还要考虑忽略解析dtd,以及文件过大不得不减少服务器读写压力(DOM解析文件时让服务器宕机了)。所以采取了SAX的方式来解析XML文件,并在解析前对XML文件进行预处理。
1.处理非法字符
样例中的xml文件涉及到非法字符,用不到的情况下需要做消去处理,避免读取xml文件时抛出0xdd异常,处理代码如下(写的不好请见谅)
public static String XmlFileToStr(File xmlFile) {
StringBuffer xmlString = new StringBuffer();
char current;
String temp;
try {
InputStream in = new FileInputStream(xmlFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(in, "utf-8"), 10 * 1024 * 1024);
while (reader.ready()) {
temp = reader.readLine();
for (int i = 0; i < temp.length(); i++) {
current = temp.charAt(i);
if ((current == 0x9) ||
//换行为0x0A或0xA
(current == 0xA) ||
(current == 0xD) ||
//空格为0x20
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF))) {
xmlString.append(current);
}
}
}
reader.close();
in.close();
} catch (IOException e) {
e.printStackTrace();
}
return xmlString.toString();
}
2.建立要解析文本的bean类
本次以解析XML中的id为例。
这里用了lombok的几个方法,感兴趣的可以搜一下。
import lombok.AllArgsConstructor;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;
/**
* @author Amenoaki
*/
@Setter
@Getter
@AllArgsConstructor
@NoArgsConstructor
public class CmsBean {
private String id;
}
3.定义Handler类继承DefaultHandler类,以便我们能选择性实现我们需要的方法
DefaultHandler类常用方法解析
startDocument() :读取文档开头时调用,可在此方法中进行预处理操作,比如:初始化bean类,或者容器
endDocument() : 读取文档结束时调用,在此方法中进行结尾工作
startElement() : 开始标签时触发
endElment() : 结束标签时触发
characters() : 处理文件中读取到内容,即标签间内容,注意:标签后的空格及tab键也会被读取到。
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.util.ArrayList;
import java.util.List;
/**
* @author Amenoaki
*/
public class CmHandler extends DefaultHandler {
//标志解析到哪一个节点
private String currentTag;
private List<CmsBean> cmsData;
private CmsBean cmBean;
//判断数据是否重复
private static String existDataOfId = "";
public List<CmsBean> getCmsData() {
return cmsData;
}
@Override
public void startDocument() throws SAXException {
super.startDocument();
cmsData = new ArrayList<>(10);
}
@Override
public void endDocument() throws SAXException {
super.endDocument();
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
super.startElement(uri, localName, qName, attributes);
if (qName.equals("attr")) {
if (attributes.getValue(0).equals("id")) {
cmBean = new CmsBean();
this.currentTag = attributes.getValue(0);
}
}
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
super.endElement(uri, localName, qName);
if (qName.equals("attr") && this.cmBean != null) {
//当数据不重复时,存入
if (!existDataOfId.equals(this.cmBean.getId())) {
existDataOfId = this.cmBean.getId();
this.cmsData.add(this.cmBean);
}
}
this.currentTag = null;
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
super.characters(ch, start, length);
String tagName = this.currentTag;
//避免拿到空值
if (tagName != null && !tagName.equals("") && tagName.length() > 0 && this.cmBean != null) {
//data为解析后得到的数据
String data = new String(ch, start, length);
//给CM对象赋值
setCm(tagName, data);
}
}
public void setCm(String tagName, String data) {
//需要多个属性值可以作不同判断
switch (tagName) {
case "id":
this.cmBean.setId(data);
break;
default:
break;
}
}
}
开始使用我们的SAX解析
public static List<CmsBean> xmlToMapBySAX(String xml){
List<CmsBean> cmsData = new ArrayList<>();
try{
SAXParserFactory factory=SAXParserFactory.newInstance();
SAXParser saxParser=factory.newSAXParser();
CmHandler cmHandler = new CmHandler();
//根据情况更改字体编码,此处样例为UTF_8.
saxParser.parse(new InputSource(new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8))), cmHandler);
cmsData = cmHandler.getCmsData();
}catch (Exception e){
e.printStackTrace();
}
return cmsData;
}
4.main函数示例
public List<Map<String, Object>> parseFile(File file) {
String xml = RanXmlParse.XmlFileToStr(file);
List<Map<String, Object>> list = new ArrayList<>();
List<CmsBean> cmsBeans = TreeUtil.xmlToMapBySAX(xml);
for (CmsBean cmsDatum : cmsBeans) {
Map<String, Object> map = new HashMap(16);
//有多个属性时酌情修改
map.put("id", cmsDatum.getId());
list.add(map);
}
return list;
}
参考资料
SAX解析XML文件
网友评论