Usage Instructions

Transparent Parser Construction

NekoHTML is designed to be as lightweight and simple to use as possible. Using the Xerces 2.0.0 parser as a foundation, NekoHTML can be transparent for applications that instantiate parser objects with the Java API for XML Processing (JAXP). Just put the appropriate NekoHTML jar files in the classpath before the Xerces jar files. For example (on Windows): [Note: The classpath should be contiguous. It is split among separate lines in this example to make it easier to read.]

> java -cp nekohtml.jar;nekohtmlXni.jar;
           xmlParserAPIs.jar;xercesImpl.jar;xercesSamples.jar 
       sax.Counter doc/index.html
doc/index.html: 10 ms (49 elems, 21 attrs, 0 spaces, 2652 chars)

The Xerces2 implementation dynamically instantiates the default parser configuration to construct parser objects via the Jar service facility. The Jar file nekohtmlXni.jar contains a META-INF/services file that is read by Xerces2 implementation for this purpose. Therefore, as long as this Jar file appears before the Xerces2 Jar files, the NekoHTML parser configuration will be used instead of the Xerces2 standard configuration.

Using this method will cause every Xerces2 parser constructed (using standard APIs) in the same JVM to use the HTML parser configuration. If this is not what you want to do, you should create the NekoHTML parser explicitly even though you parse and access the document contents using standard XML APIs. The following sections describe this method in more detail.

Note: The nekohtmlXni.jar file is no longer built by default. This change was made to alleviate confusion about which Jar files to add to the JVM classpath. If you still want to use this Jar file, you must build it using the "jar-xni" Ant task.

Convenience Parser Classes

If you don't want to override the default Xerces2 parser instantiation mechanism, separate DOM and SAX parser classes are included in the org.cyberneko.html.parsers package for convenience. Both parsers use the HTMLConfiguration class to be able to parse HTML documents. In addition, the DOM parser uses the Xerces HTML DOM implementation so that the returned documents are of type org.w3c.dom.html.HTMLDocument. The following example shows how to use the NekoHTML DOMParser directly:

package sample;

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;

public class TestHTMLDOM {
    public static void main(String[] argv) throws Exception {
        DOMParser parser = new DOMParser();
        for (int i = 0; i < argv.length; i++) {
            parser.parse(argv[i]);
            print(parser.getDocument(), "");
        }
    }
    public static void print(Node node, String indent) {
        System.out.println(indent+node.getClass().getName());
        Node child = node.getFirstChild();
        while (child != null) {
            print(child, indent+" ");
            child = child.getNextSibling();
        }
    }
}

Running this program produces the following output: [Note: The classpath should be contiguous. It is split among separate lines in this example to make it easier to read.]

> java -cp nekohtml.jar;nekohtmlSamples.jar;
           xmlParserAPIs.jar;xercesImpl.jar
       sample.TestHTMLDOM data/test01.html
org.apache.html.dom.HTMLDocumentImpl
 org.apache.html.dom.HTMLHtmlElementImpl
  org.apache.html.dom.HTMLBodyElementImpl
   org.apache.xerces.dom.TextImpl

This source code is included in the src/sample/ directory.

In addition to the provided DOM and SAX parser classes, NekoHTML also provides a DOM fragment parser class. The DOMFragmentParser class, found in the org.cyberneko.html.parsers package, in can be used to parse fragments of HTML documents into their corresponding DOM nodes. The following example shows how to use the NekoHTML DOMFragmentParser directly:

package sample;

import org.cyberneko.html.parsers.DOMFragmentParser;
import org.apache.html.dom.HTMLDocumentImpl;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentFragment;
import org.w3c.dom.Node;
import org.w3c.dom.html.HTMLDocument;

public class TestHTMLDOMFragment {
    public static void main(String[] argv) throws Exception {
        DOMFragmentParser parser = new DOMFragmentParser();
        HTMLDocument document = new HTMLDocumentImpl();
        for (int i = 0; i < argv.length; i++) {
            DocumentFragment fragment = document.createDocumentFragment();
            parser.parse(argv[i], fragment);
            print(fragment, "");
        }
    }
    public static void print(Node node, String indent) {
        System.out.println(indent+node.getClass().getName());
        Node child = node.getFirstChild();
        while (child != null) {
            print(child, indent+" ");
            child = child.getNextSibling();
        }
    }
}

This source code is included in the src/sample/ directory.

Notice that the application parses a document fragment a little bit differently than parsing a complete document. Instead of initiating a parse by passing in a system identifier (or an input source), parsing an HTML document fragment requires the application to pass a DOM DocumentFragment object to the parse method. The DOM fragment parser will use the owner document of the DocumentFragment as the factory for parsed nodes. These nodes are then appended in document order to the document fragment object.

Note: In order for HTML DOM objects to be created, the document fragment object passed to the parse method should be created from a DOM document object of type org.w3c.dom.html.HTMLDocument.

Custom Parser Classes

Alternatively, you can construct any XNI-based parser class using the HTMLConfiguration parser configuration class found in the org.cyberneko.html package. The following example shows how to extend the abstract SAX parser provided with the Xerces2 implementation by passing the NekoHTML parser configuration to the base class in the constructor.

package sample;

import org.apache.xerces.parsers.AbstractSAXParser;
import org.cyberneko.html.HTMLConfiguration;

public class HTMLSAXParser extends AbstractSAXParser {
    public HTMLSAXParser() {
        super(new HTMLConfiguration());
    }
}

This source code is included in the src/sample/ directory.