Tuesday, August 14, 2007

Java CAPS: Processing Large XML Payloads Using a SAX Parser

Introduction


In Java CAPS the standard way to deal with XML files is to parse them through Object Type Definitions (OTD). An OTD represents a XML file as a Java object, it provides marshal and unmarshal methods, plus setters and getters for each XML document's element.
The OTD is a smart way to create a DOM tree in memory, starting from the XML source document. However, if the XML file is large loading it entirely in memory through a DOM representation is generally not a great idea. In this case is common to use a SAX parser, which allows to process the XML file as a stream instead of loading the entire object in memory. SAX parsing is of course easily implementable in Java CAPS, as this article will briefly show.

Implementation


The implementation is straightforward, it is just plain Java code. In this example an the eGate flow is triggered by an event in the form of a JMS message containing the filename. As the XML file we'd like to process is assumably large (otherwise why bother us with SAX...) it probably resides in some filesystem, so in this case a BatchLocalFile (part of the optional Batch eWay) can be used to read it. You are not doing such a stupid thing like sending multiple megabytes payloads through your JMS server, aren't you? As a general rule of thumb, it is a wise idea to keep your JMS payloads below 1 Mb, to avoid overloading your JMS server. As already explained in other posts, I think moving bigger payloads through JMS is a clear indicator of some flaws in your process' design and, sooner or later, it will drive to troubles.

Connectivity Map


Below the simple CM for this example:

The queIn channel receives triggering events for the svcSaxParser service, which makes use of a BatchLocalFile external application to read the file from disk. The JCD, as described below, is really trivial and logs some elements using the standard logger.

Java Collaboration Definition


the SAX parsing service is implemented through a JCD called jcdSaxParser. It receives the input JMS message containing the filename, opens the InputStream from disk and assign it to the SAX parser. A SAX's DefaultHandler inner class, called (with some lack of fantasy...) MyHandler, is defined and used to intercept SAX events:

package SamplesprjSAXJCD;

import java.io.InputStream;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

public class jcdSaxParser
{
public com.stc.codegen.logger.Logger logger;
public com.stc.codegen.alerter.Alerter alerter;
public com.stc.codegen.util.CollaborationContext collabContext;
public com.stc.codegen.util.TypeConverter typeConverter;

public void receive( com.stc.connectors.jms.Message input, com.stc.eways.batchext.BatchLocal BatchLocalFile_1 )
throws Throwable
{
try {
BatchLocalFile_1.getConfiguration().setTargetDirectoryName( "D:\\Projects" );
BatchLocalFile_1.getConfiguration().setTargetFileName( input.getTextMessage() );
InputStream istream = BatchLocalFile_1.getClient().getInputStreamAdapter().requestInputStream();
// Create a handler to handle SAX events
DefaultHandler handler = new MyHandler( logger );
// Parse the stream
parseXmlStream( istream, handler, false );
BatchLocalFile_1.getClient().getInputStreamAdapter().releaseInputStream( true );
} catch ( Exception ex ) {
logger.error( "@@@ ", ex );
}
}

// Parses an XML stream using a SAX parser.
public static void parseXmlStream( InputStream istream, DefaultHandler handler, boolean validating )
throws Exception
{
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating( validating );
factory.newSAXParser().parse( istream, handler );
}

// DefaultHandler contain no-op implementations for all SAX events.
// This class should override methods to capture the events of interest.
static class MyHandler extends DefaultHandler
{
private final com.stc.codegen.logger.Logger _logger;
private final StringBuffer _buff = new StringBuffer( 1024 );

public MyHandler( com.stc.codegen.logger.Logger logger )
{
_logger = logger;
}

public void startElement( String uri, String localName, String qName, Attributes attributes )
throws SAXException
{
_buff.append( "startElement: uri=" ).append( uri ).append( ", localName=" ).append( localName ).append( ", qName=" ).append( qName ).append( "\n" );
}

public void characters( char[] cbuf, int start, int len )
throws SAXException
{
_buff.append( "Characters: " ).append( new String( cbuf, start, len ) );
}

public void endElement( String uri, String localName, String qName )
throws SAXException
{
if (_buff.length() > 0) {
_logger.info( "@@@ " + _buff.toString() );
_buff.delete( 0, _buff.length() );
}
}
}
}

After creating a proper Deployment Profile you can run this flow by sending a JMS message containing the filename into queue queIn (you can use the eManager for that). Then you just need to add to the MyHandler class some more useful functionality.

The source stream was obtained from the InputStramAdapter of the BatchLocalFile:
InputStream istream = BatchLocalFile_1.getClient().getInputStreamAdapter().requestInputStream();
Then the parsing is done by passing both the InputStream and the handler to the SAXPArser's parse method:
factory.newSAXParser().parse( istream, handler );

Conclusions


If you were struggling with 100 Mb big XML files and using OTD you've got plenty of OutOfMemory errors, you could try to implement a SAX parsing process as described in this article. Before implementing this technique ask yourself why the hell you are producing so big XML files and then try to fix your data model or your process, because to me you are using XML the wrong way.
A typical case where dealing with large XML files could be unavoidable is for HL7 v.3.0 XML messages: specs define huge XML Schemas for that standard, it could be even impossible to generate an OTD with the eDesigner.

1 comment:

  1. You may also want to look at vtd-xml,the latest and most advanced xml technology

    http://vtd-xml.sf.net

    ReplyDelete