MSV has a native API for validation which enables better error reporting and flexible validation. This document describes this native API of MSV.
The native API consists of two interfaces: Acceptor
and DocumentDeclaration
.
DocumentDeclaration
is the VGM. Its sole purpose is to create an Acceptor
which validates the top level sequence, which is usually the root element.
An Acceptor
performs a validation for one content model (siblings). It can create new "child" acceptors to validate child content models, thereby validating the whole tree.
One simple way to compile a schema into a VGM is to use the GrammarLoader.loadVGM
method. This method takes a schema as an argument and compiles it into a AGM, then wrap it into VGM. The source code of GrammarLoader
should reveal how you can create VGM in other ways.
It is important that some schema languages may use different VGM implementations, or there might be more than one VGM implementations for one schema language.For example, right now W3C XML Schema uses com.sun.verifier.regexp.xmlschema.XSREDocDecl
while all others use com.sun.verifier.regexp.REDocumentDecl
. So creating a VGM from an AGM is non trivial.
Let's assume that we have a DocumentDeclaration
object and see how we can perform a plain-vanilla validation by traversing a DOM tree.
From the higher point of view, the validation will be done by passing information about XML document through various methods of the Acceptor
interface, creating acceptors for each element.
The first thing to do is to create an Acceptor
and use it to validate the top level, as follows:
void validate( Document dom, DocumentDeclaration docDecl ) { Acceptor acc = docDecl.createAcceptor(); return validateElement(dom.getDocumentElement(),acc); }
The validateElement
method is defined here as validating a given element with a given acceptor:
void validateElement( Element node, Acceptor acc ) { ... }
Validation of an element is done by the createChildAcceptor
method.
This method creates a child acceptor, which will validate children of that element. This method takes a StartTagInfo
as a parameter; this object holds the information about the element name and attributes (information about the start tag), and you are responsible for creating that object.
void validateElement( Element node, Acceptor acc ) { org.xml.sax.helpers.AttributesImpl atts = /* create SAX Attributes object from attributes of this node. */ // StartTagInfo uses Attributes object for keeping attributes. StartTagInfo sti = new StartTagInfo( node.getNamespaceURI(), // information about the element name. node.getLocalName(), node.getName(), attributes, context ); Acceptor child = acc.createChildAcceptor(sti,null); if(child==null) throw new InvalidException(); }
If there is a validation error (e.g., unexpected element), the createChildAcceptor
method returns null.
Once you create a child acceptor, the next thing to do is to validate children (attributes of that element, child elements, and texts within that element) with it. After that, call the isAcceptState
method to see if the child acceptor is "satisfied". An acceptor is satisfied when the whole content model was OK.
Acceptor child = acc.createChildAcceptor(sti,null); if(child==null) throw new InvalidException(); validateChildren(node,child); // test if it's OK to end the contents here. if(!child.isAcceptState()) throw new InvalidException();
For example, when the content model is (a,b,c)
and the actual content is <a/><b/>
, then the acceptor won't be satisfied because it still need to see c
.
So when false
is returned from this method, then it means mandatory elements are missing.
Once you make sure that the child acceptor is in a valid state, then you'll pass it back to the parent acceptor. The parent acceptor will step forward (think of it as an automaton) by eating the child acceptor.
acc.stepForward(child);
The complete code of the validateElement
method will be as follows:
void validateElement( Element node, Acceptor acc ) { // create StartTagInfo StartTagInfo sti = new StartTagInfo( ... ); Acceptor child = acc.createChildAcceptor(sti,null); if(child==null) throw new InvalidException(); validateChildren(node,child,sti); // test if it's OK to end the contents here. if(!child.isAcceptState()) throw new InvalidException(); acc.stepForward(child); }
Let's move on to the validateChildren
method.
First, call the onAttribute
method for each attribute:
void validateChildren( Element node, Acceptor acc, StartTagInfo sti ) { NamedNodeMap atts = node.getAttributes(); for( int i=0; i<atts.getLength(); i++ ) { Attr a = atts.item(i); if( !acc.onAttribute(a.getNamespaceURI(),a.getLocalName(), ... ) ) throw new InvalidException(); } }
It returns false if there is an error in the attribute (e.g., undefined attribute, or the attribute value is wrong).
Then, call the onEndAttributes
method to indicate that no more attribute is there.
if(!acc.onEndAttributes(acc,null)) throw new InvalidException();
This method returns false when there has to be more attributes. For example, this method returns false when a mandatory attribute is missing.
Once you processed attributes, you'll process the children (contents) of the element.
node.normalize(); for( Node n = node.getFirstChild(); n!=null; n=n.getNextSibling() ) { switch(n.getNodeType()) { case Node.ELEMENT_NODE: validateElement( (Element)n, acc ); break; case Node.TEXT_NODE: case Node.CDATA_SECTION_NODE: String text = n.getNodeValue(); if(!acc.onText(text,context,null,null)) throw new InvalidException(); break; } }
It is important to normalize the DOM tree. This is because the onText
method has to be called with the whole text chunk. For example, if you have an XML like <foo>abcdef</foo>
, then you cannot call the onText
method twice by splitting "abcdef" into two substrings.
The onText
method returns false if the text is invalid. Usually, it is because the text is not allowed there at all, or the text is invalid wrt the datatype.
The following table summarizes atoms in XML documents and actions you have to take.
Atom | Action |
start tag |
call the createChildAcceptor and switch to the child acceptor
|
end tag |
call the isAcceptState then stepForward , switch back to the parent acceptor.
|
attribute |
call the onAttribute method. Don't forget to call the onEndAttributes .
|
text |
call the onText method. Be careful with the normalization.
|
Although I didn't mentioned in the previous section, one needs to specify a "context" object (com.sun.msv.verifier.IDContextProvider
) to some of the abovementioned methods. Those objects are used to provide contextual information (like namespace prefix bindings, the base URI, etc). For example, "QName" datatype needs to resolve a namespace prefix into a namespace URI.
You have to implement a context object by yourself and pass it to methods that need it. If you are not interested in xml:base
, then you can return null from the getBaseUri
method. Similarly, if you don't care about entities and notations, then you can return false from the isNotation
and isUnparsedEntity
methods.
Most of the methods on the Acceptor
interface returns false to indicate a validation error. To obtain more detailed error message, pass a StringRef
object to those methods.
Consider the following example for the isAcceptState
method:
if(!acc.isAcceptState(null)) { // there was an error in the document. // create a StringRef object. This object will // receive error message. StringRef ref = new StringRef(); // call the isAcceptState method again acc.isAcceptState(ref); // print the error message System.out.println(ref.str); }
These methods do not change the state of the acceptor when they return false. So you can call the same method again (with a valid StringRef
object) to get the error message.
If you specify a StringRef
object, the acceptor will recover from the error as a side-effect. For example, if the createChildAcceptor
method returns null and you call the same method again with a StringRef
, then it will return a valid child acceptor object.
Acceptor child = parent.createChildAcceptor(sti,null); if(child==null) { // get the error message StringRef ref = new StringRef(); child = parent.createChildAcceptor(sti,ref); System.out.println(ref.str); // the above statement will return a valid acceptor // so we can continue validating documents. } ...
The same recovery behavior will apply for all other methods. This makes it possible to continue validation after seeing errors.
Note that because the error recovery is highly ad-hoc, somtimes it will fall into the panic mode, in which a lot of false errors are reported. So you may want to implement some kind of filters to suppress error messages until you are sure that it gets back to sync.
Acceptors can be always cloned by calling the createClone
method. Such a clone is useful to "bookmark" a particular element of a document.
For example, you can run the normal validation once to associate each DOM Node
with Acceptor
. Later, you can use that cloned acceptor to re-validate a subtree.
In the onText
and onAttribute
methods, applications can obtain datatypes that are assigned to those text.
To obtain this information, pass a non-null DatatypeRef
object to those methods. Upon the method completion, this DatatypeRef
object will receive an array of Datatype
s.
When the array is null or empty, it means there was an error or the datatype was not uniquely determined. When there is only one item in the array, it means the attribute value (or the text) is validated as that datatype. If there are more than one items in the array, it measn the attribute value (or the text) was validated as a <list> (of RELAX NG) and each datatype in the array indicates the datatype of each token.