Section 1: Basic elements to understand XML and Axl
Section 2: Manipulating and producing XML documents
Section 3: Doing validation on your documents
Section 4: Advanced topics
Apendix
It is a "quite" human readable format, so you will find that is not the best format if you are looking for space efficiency. What XML 1.0 provides you on the other hand is the ability to quickly prototype and produce working formats that encapsulate your data, and, as your system evolves, XML 1.0 will do it with you.
Among other things, XML 1.0 provides you ways to validate your documents to ensure your code will read XML documents in the format expected, reducing the time and development cost due to additional checkings required.
Before continuing, we will explain some concepts that are required to understand XML 1.0 and why the Axl API was built this way.
<?xml version="1.0"> <!-- This is a comment --> <complex> <data> <simple>10</simple> <empty attr1="value1" /> </data> </complex>
Previous XML document represents an structure with a top level node, called complex, that has one single child called data which in turn have two childs. The first one is the child called simple that have content and other one, called empty, which is a node usually called an empty xml node.
The XML representation for previous document is the following:
Document representation
Alternatively, while using the MIXED API, you can traverse child items stored for a particular node, detecting those items that are ITEM_CONTENT or ITEM_CDATA (using axl_item_get_type).
This is particularly important if you take into consideration that a node could have content (ITEM_CONTENT), comments (ITEM_COMMENT), application process instructions (ITEM_PI), CDATA content (uninterpreted content ITEM_CDATA), all of them mixed with more xml nodes (ITEM_NODE).
So, to summarize, we have a root node, that could contain more nodes, that could contain PCDATA, or content, and those nodes could contain named attributes with values.
The reason for this API is simple. XML 1.0 definition allows to mix content with more nodes, comments and many more elements to be placed as childs for a particular node.
This definition, found at the standard, have moved many XML implementations to support only an API that support all these features, that is, an interface that is complicated and overloaded, that gives you a power that you don't require, making your development more inefficient.
As a result, when a developer only requires a usual form of xml, called CHILDREN, that means nodes have only another childs nodes or content but not both at the same time. This kind of xml is really useful, easy to parse, easy to make a DTD definition, more compact and extensible.
Lets see an example for both formats to clarify:
<?xml version='1.0' ?> <document> <!-- Children XML format example: as you can see --> <!-- nodes only contains either nodes or node content --> <!-- but nothing mixed at the same level --> <node1> This is node1 content </node1> <node2> <node3> This is node3 content </node3> <node4 /> </node2> </document>
While an MIXED xml document could be:
<?xml version='1.0' ?> <document> <!-- Children XML format example: as you can see --> <!-- nodes only contains either nodes or node content --> <!-- but nothing mixed at the same level --> <node1> This is node1 content </node1> Content mixed with xml nodes at the same level. <node2> More content.... <node3> This is node3 content </node3> <node4 /> </node2> </document>
Both approaches, which are valid using the XML 1.0 standard, are appropriate for particular situations:
Having introduced the context of the problem, Axl Library takes no position, providing an API that fits while developing xml content that follows a CHILDREN description and an API for the MIXED description.
In this context, which API you use, will only affect to the way you traverse the document. The CHILDREN API is mainly provided by the Axl Node interface and the MIXED API is mainly provided by the Axl Item interface.
You don't need to do any especial operation to activate both APIs, both are provided at the same time. Lets see an example:
Supposing the previous mixed example, the following code will get access to the <node2> reference:
// supposing "doc" reference contains the document loaded axlNode * node; // get the document root, that is <document> node = axl_doc_get_root (doc); // get the first child for the document root (<node1>) node = axl_node_get_first_child (node); // get the next child (brother of <node1>, that is <node2>) node = axl_node_get_next (node);
However, with the MIXED API you can get every detail, every item found for a particular node. This is how:
// supposing "doc" reference contains the document loaded axlNode * node; axlItem * item; // get the document root, that is <document> node = axl_doc_get_root (doc); // get the first item child for the document root that is the comment: // "Children XML format example: as you can see". item = axl_item_get_first_child (node); // now skip the following two comments item = axl_item_get_next (item); item = axl_item_get_next (item); // now the next item is holding the <node1> item = axl_item_get_next (item); node = axl_item_get_data (item); // now get the content between the <node1> and <node2> item = axl_item_get_next (item); // and finally, get the next child (brother of <node1>, that is // <node2>) item = axl_item_get_next (item); node = axl_item_get_data (item);
Obviously, the mixed example contains more code and it is more fragile to xml document changes. The problem is that the MIXED API is more general than the CHILDREN, making XML libraries to only provide that API.
As a consequence:
Let's start with a very simple example:
#include <axl.h> #include <stdio.h> int main (int argc, char ** argv) { axlError ** error; // top level definitions axlDoc * doc = NULL; // initialize axl library if (! axl_init ()) { printf ("Unable to initialize Axl library\n"); return -1; } // get current doc reference doc = axl_doc_parse_from_file ("large.xml", error); if (doc == NULL) { axl_error_free (error); return false; } // DO SOME WORK WITH THE DOCUMENT HERE // release the document axl_doc_free (doc); // cleanup axl library axl_end (); return true; }
First you must use axl_doc_get_root to get the document root (axlNode) which contains all the information. Then, according to the interface you are using, you must call to either axl_node_get_first_child or axl_item_get_first_child.
Once you have access to the first element, you can use the following set of function to get more references to other nodes or items:
There are alternative APIs that will allow you to iterate the document, providing a callback: axl_doc_iterate.
Another approach is to use axl_doc_get and axl_doc_get_content_at to get fast access to a particular node using a really limited XPath syntax.
Check the following function while operating with axlNode elements:
Check the following functions while operating with axlItem elements:
In the case you want to produce xml content taking as reference a particular node use:
You may also need to check that some XML document received follows a defined XML structure, but it is too complex to be done.
For this purpose, XML 1.0 defines DTD or (Document Type Definition) which allows to specify the document grammar, how are nested nodes, which attributes could contain, or if the are allocated to be empty nodes or nodes that must have another child nodes.
Let start with the DTD syntax used to configure restrictions about node structure:
<!-- sequence specification --> <!ELEMENT testA (test1, test2, test3)> <!-- choice specification --> <!ELEMENT testB (test1 | test2 | test3)>
DTD <!ELEMENT is modeled on top of two concepts which are later expanded with repetition patterns. We will explain then later. For now, this two top level concepts are: sequence and choice.
Sequence specification (elements separated by , (comma), the one used to apply restriction to the node testA, are used to denote that testA have as childs test1, followed by test2 and ended by test3. The order specified must be followed and all instances must appear. This could be tweaked using repetition pattern.
In the other hand, choice specification (elements separated by | (pipe), are used to specify that the content of a node is built using nodes of the choice list. So, in this case, testB node could have either one instance of test1 or test2 or test3.
Now you know these to basic elements to model how childs are organized for a node, what it is need is to keep on adding more <!ELEMENT directives until all nodes are specified. You will end your DTD document with final nodes that are either empty ones or have PCDATA. At this moment MIXED nodes are not supported.
Suppose that all nodes that are inside testA and testB are final ones. Then this could be its DTD specification:
<!-- test1 is a node that only have content --> <!ELEMENT test1 (#PCDATA)> <!-- test2 is a node that is always empty --> <!ELEMENT test1 EMPTY> <!-- test3 is a node that could have either test1 or test2 --> <!ELEMENT test3 (test1 | test2)>
Sequences and choices could be composed to create richer DTD expressions that combines sequences of choices and so on.
At this point all required elements to model choices, sequences and final nodes are explained, but, we have to talk about repetition pattern. They are symbols that are appended to elements inside choices (or sequences) including those list specifications.
Patterns available are: +, ? and *. By default, if no pattern is applied to the element, it means that the match should be produced one and only one time.
The + pattern is used to model that element should be matched one, and at least one, or more.
The * pattern is used to model elements that should be matched zero or any times.
The ? pattern is used to model elements that should be matched zero or one times.
For the exampled initially explained, let's suppose we want that the content inside testA have sequences repeated at leat one time, being that sequence: test1, test2 and test3. We only need to add a + repetition pattern as follows:
<!-- sequence specification --> <!ELEMENT testA (test1, test2, test3)+>
So, we are saying to our validation engine that the sequence inside testA could be found one or many times, but the entire sequence match be found every time.
Here is an simple example that loads an XML document, then loads an DTD file, and then validates the XML document:
bool test_12 (axlError ** error) { axlDoc * doc = NULL; axlDtd * dtd = NULL; // parse gmovil file (an af-arch xml chunk) doc = axl_doc_parse_from_file ("channel.xml", error); if (doc == NULL) return false; // parse af-arch DTD dtd = axl_dtd_parse_from_file ("channel.dtd", error); if (dtd == NULL) return false; // perform DTD validation if (! axl_dtd_validate (doc, dtd, error)) { return false; } // free doc reference axl_doc_free (doc); // free dtd reference axl_doc_free (dtd); return true; }
Think about using <table> as a tag for your document. Many XML applications uses <table> as a valid tag for its XML language set. However, each of them has a different meaning and must be handled by the proper XML software.
While developing applications with XML, and supposing such XML documents will be used by more applications than yours, you are likely to be interested in use XML Namespaces. In other words, many of the new XML standards that are appearing uses XML Namespaces to allow defining its xml node names, while allowing users/developers to use their own set of xml tags, under their own XML Namespaces, in order they can use them in the same document.
XML Namespaces support inside Axl Library is handled through a separated library, which requires the base library to function. Here are some instructions to get Axl Library Namespace installed.
This library provides functions that replaces some of the functions used by XML applications that don't require XML Namespaces. In particular, some of them are:
See also API documentation for all functions that are provided to enable your application with XML Namespaces:
./configure --axl-log-disable
According to our results, the library including the log to console information is about 265K. Without log to console information the library takes about 200K.
make install-strip
According to our results, the library without log to console and debugging information takes about 89K.
Previous information applies to the Axl base Library (libaxl.so/.dll), however the same happens for the rest of software components bundle with Axl.
Please, if you find that something isn't properly documented or you think that something could be improved, contact us in the mailing list. We are building Axl Library with the aim to produce a high quality, commercial grade, open source XML development kit, so, any help received will be welcome.
Remember you can always contact us at the mailing list for any question not properly answered by this documentation. See Axl Library website documentation to get more information about mailing list.