CHAPTER 8 – XML with PHP 5 – PARSING XML
Two techniques are used for parsing XML documents in PHP: SAX (Simple API for XML) and DOM (Document Object Model). By using SAX, the parser goes through your document and fires events for every start and stop tag or other element found in your XML document. You decide how to deal with the generated events. By using DOM, the whole XML file is parsed into a tree that you can walk through using functions from PHP. PHP 5 provides another way of parsing XML: the SimpleXML extension. But first, we explore the two mainstream methods.
SAX We now leave the somewhat boring theory behind and start with an example. Here, we're parsing the example XHTML file we saw earlier. We do that by using the XML functions available in PHP (http://php.net/xml). First, we cre- ate a parser object: $xml = xml_parser_create('UTF-8'); The optional parameter, 'UTF-8', denotes the encoding to use while pars- ing. When this function executes successfully, it returns an XML parser han- dle for use with all the other XML parsing functions. Because SAX works by handling events, you need to set up the handlers. In this basic example, we focus on the two most important handlers: one for start and end tags, and one for character data (content): xml_set_element_handler($xml, 'start_handler', 'end_handler'); xml_set_character_data_handler($xml, 'character_handler'); These statements set up the handlers, but they must be implemented before any actions occur. Let's look at how the handler functions should be implemented. In the previous statement, the start_handler is passed three parameters: the XML parser object, the name of the tag, and an associative array contain- ing the attributes defined for the tag. function start_handler ($xml, $tag, $attributes) { global $level; echo "n". str_repeat(' ', $level). ">>>$tag"; foreach ($attributes as $key => $value) { echo " $key $value"; } $level++; } The tag name is passed with all characters uppercased if case folding is enabled (the default). You can turn off this behavior by setting an option on the XML parser object, as follows: xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false); The end handler is not passed the attributes array, only the XML parser object and the tag name: function end_handler ($xml, $tag) { global $level; $level--; echo str_repeat(' ', $level, ' '). "<<<$tag; } To make our test script work, we need to implement the character han- dler to show all content. We wrap the text in this handler so that it fits nicely on our terminal screen: function character_handler ($xml, $data) { global $level; $data = split("n", wordwrap($data, 76 ($level * 2))); foreach ($data as $line) { echo str_repeat(($level + 1), ' '). $line. "n"; } } After we implement all the handlers, we can start parsing our XML file: xml_parse($xml, file_get_contents('test1.xhtml')); The first part of the output of our script looks like this: >>>HTML XMLNS='http://www.w3.org/1999/xhtml' XML:LANG='en' LANG='en' || || | | >>>HEAD || || | | >>>TITLE |XML Example| <<<TITLE It doesn't look very pretty. There's a lot of whitespace because the charac- ter data handler is called for every bit of data. We can improve the results by putting all data in a buffer, and only outputting the data when the tag closes or when another tag starts. The new script looks like this: <?php /* Initialize variables */ $level = 0; $char_data = ''; /* Create the parser handle */ $xml = xml_parser_create('UTF-8'); /* Set the handlers */ xml_set_element_handler($xml, 'start_handler', 'end_handler'); xml_set_character_data_handler($xml, 'character_handler'); /* Start parsing the whole file in one run */ xml_parse($xml, file_get_contents('test1.xhtml')); /**************************************************************** * Functions */ /* * Flushes collected data from the character handler */ function flush_data () { global $level, $char_data; /* Trim data and dump it when there is data */ $char_data = trim($char_data); if (strlen($char_data) > 0) { echo "n"; // Wrap it nicely, so that it fits on a terminal screen $data = split("n", wordwrap($char_data, 76-($level *2))); foreach ($data as $line) { echo str_repeat(' ', ($level +1))."[".$line."]n"; } } /* Clear the data in the buffer */ $char_data = ''; } /* * Handler for start tags */ function start_handler ($xml, $tag, $attributes) { global $level; /* Flush collected data from the character handler */ flush_data(); /* Dump attributes as a string */ echo "n". str_repeat(' ', $level). "$tag"; foreach ($attributes as $key => $value) { echo " $key='$value'"; } /* Increase indentation level */ $level++; } function end_handler ($xml, $tag) { global $level; /* Flush collected data from the character handler */ flush_data(); /* Decrease indentation level and print end tag */ $level--; echo "n". str_repeat(' ', $level). "/$tag"; } function character_handler ($xml, $data) { global $level, $char_data; /* Add the character data to the buffer */ $char_data .= ' '. $data; } ?> The output looks more decent, of course: HTML XMLNS='http://www.w3.org/1999/xhtml' XML:LANG='en' LANG='en' HEAD TITLE [XML Example] /TITLE /HEAD BODY BACKGROUND='bg.png' P [Moved to] A HREF='http://example.org/' [example.org] /A [.] BR /BR [foo & bar] /P /BODY /HTML
DOM Parsing a simple X(HT)ML file with a SAX parser is a lot of work. Using the DOM (http://www.w3.org/TR/DOM-Level-3-Core/) method is much easier, but you pay a price--memory usage. Although it might not be noticeable in our small example, it's definitely noticeable when you parse a 20MB XML file with the DOM method. Rather than firing events for every element in the XML file, DOM creates a tree in memory containing your XML file. Figure 8.1 shows the DOM tree that represents the file from the previous section. Root Node root Content Attribute Document type html lang=en head body background=bg.png title p XML template Moved to: a br food & bar href=http://example.org example.org Fig. 8.1 DOM tree. We can show all the content without tags by walking through the tree of objects. We do so in this example by recursively going over all node children: 1 <?php 2 $dom = new DomDocument(); 3 $dom->load('test2.xml'); 4 $root = $dom->documentElement; 5 6 process_children($root); 7 8 function process_children($node) 9 { 10 $children = $node->childNodes; 11 12 foreach ($children as $elem) { 13 if ($elem->nodeType == XML_TEXT_NODE) { 14 if (strlen(trim($elem->nodeValue))) { 15 echo trim($elem->nodeValue)."n"; 16 } 17 } else if ($elem->nodeType == XML_ELEMENT_NODE) { 18 process_children($elem); 19 } 20 } 21 } 22 ?> The output is the following: XML Example Moved to example.org . foo & bar The example shows some very simple DOM processing. We only read attributes of elements and do not call any methods. In line 4, we retrieve the root element of the DOM document that was loaded in line 3. For every ele- ment we encounter, we call process_children() (in lines 6 and 18), which iter- ates over the list of child nodes (line 12). If the node is a text node, we echo its value (lines 1316) and if it's an element, we call process_children recursively (lines 1718). The DOM extension is more powerful than what is shown in this example. It implements almost all the functionality described in the DOM2 specification. The following example uses the getAttribute() methods of the DomElement class to return the background attribute of the body tag: 1 <?php 2 $dom = new DomDocument(); 3 $dom->load('test2.xml'); 4 $root = $dom->documentElement; 5 6 process_children($root); 7 8 function process_children($node) 9 { 10 $children = $node->childNodes; 11 12 foreach ($children as $elem) { 13 if ($elem->nodeType == XML_ELEMENT_NODE) { 14 if ($elem->nodeName == 'body') { 15 echo $elem->getAttributeNode('background') ->value. "n"; 16 } 17 process_children($elem); 18 } 19 } 20 } 21 ?> We still need to recursively search through the tree to find the correct element, but because we know about the structure of the document, we can simplify the example: 1 <?php 2 $dom = new DomDocument(); 3 $dom->load('test2.xml'); 4 $body = $dom->documentElement->getElementsByTagName('body') ->item(0); 5 echo $body->getAttributeNode('background')->value. "n"; 6 ?> Line 4 is the main processing line. First, we request the documentElement of the DOM document, which is the root node of the DOM tree. From that ele- ment, we request all child elements with tag name body by using getElements- ByTagName. Then, we want the first item in the list (because we know that it is the first body tag in the file is the correct one). In line 5, we request the back- ground attribute with getAttributeNode, and display its value by reading the value property.