NekoHTML is designed to be as lightweight and simple to use as possible. Using the Xerces 2.0.0 parser as a foundation, NekoHTML can be transparent for applications that instantiate parser objects with the Java API for XML Processing (JAXP). Just put the appropriate NekoHTML jar files in the classpath before the Xerces jar files. For example (on Windows): [Note: The classpath should be contiguous. It is split among separate lines in this example to make it easier to read.]
> java -cp nekohtml.jar;nekohtmlXni.jar; xmlParserAPIs.jar;xercesImpl.jar;xercesSamples.jar sax.Counter doc/index.html doc/index.html: 10 ms (49 elems, 21 attrs, 0 spaces, 2652 chars)
The Xerces2 implementation dynamically instantiates the default
parser configuration to construct parser objects via the Jar
service facility. The Jar file nekohtmlXni.jar
contains a META-INF/services
file that is read by
Xerces2 implementation for this purpose. Therefore, as long as
this Jar file appears before the Xerces2 Jar files,
the NekoHTML parser configuration will be used instead of the
Xerces2 standard configuration.
Using this method will cause every Xerces2 parser constructed (using standard APIs) in the same JVM to use the HTML parser configuration. If this is not what you want to do, you should create the NekoHTML parser explicitly even though you parse and access the document contents using standard XML APIs. The following sections describe this method in more detail.
Note: The nekohtmlXni.jar file is no longer built by default. This change was made to alleviate confusion about which Jar files to add to the JVM classpath. If you still want to use this Jar file, you must build it using the "jar-xni" Ant task.
If you don't want to override the default Xerces2 parser
instantiation mechanism, separate DOM and SAX parser classes are
included in the org.cyberneko.html.parsers
package
for convenience. Both parsers use the HTMLConfiguration
class to be able to parse HTML documents. In addition, the DOM
parser uses the Xerces HTML DOM implementation so that the
returned documents are of type
org.w3c.dom.html.HTMLDocument
. The following example
shows how to use the NekoHTML DOMParser
directly:
package sample; import org.cyberneko.html.parsers.DOMParser; import org.w3c.dom.Document; import org.w3c.dom.Node; public class TestHTMLDOM { public static void main(String[] argv) throws Exception { DOMParser parser = new DOMParser(); for (int i = 0; i < argv.length; i++) { parser.parse(argv[i]); print(parser.getDocument(), ""); } } public static void print(Node node, String indent) { System.out.println(indent+node.getClass().getName()); Node child = node.getFirstChild(); while (child != null) { print(child, indent+" "); child = child.getNextSibling(); } } }
Running this program produces the following output: [Note: The classpath should be contiguous. It is split among separate lines in this example to make it easier to read.]
> java -cp nekohtml.jar;nekohtmlSamples.jar; xmlParserAPIs.jar;xercesImpl.jar sample.TestHTMLDOM data/test01.html org.apache.html.dom.HTMLDocumentImpl org.apache.html.dom.HTMLHtmlElementImpl org.apache.html.dom.HTMLBodyElementImpl org.apache.xerces.dom.TextImpl
This source code is included in the src/sample/
directory.
In addition to the provided DOM and SAX parser classes, NekoHTML
also provides a DOM fragment parser class. The DOMFragmentParser
class, found in the org.cyberneko.html.parsers
package, in can be used to parse fragments of HTML documents
into their corresponding DOM nodes. The following example shows
how to use the NekoHTML DOMFragmentParser
directly:
package sample; import org.cyberneko.html.parsers.DOMFragmentParser; import org.apache.html.dom.HTMLDocumentImpl; import org.w3c.dom.Document; import org.w3c.dom.DocumentFragment; import org.w3c.dom.Node; import org.w3c.dom.html.HTMLDocument; public class TestHTMLDOMFragment { public static void main(String[] argv) throws Exception { DOMFragmentParser parser = new DOMFragmentParser(); HTMLDocument document = new HTMLDocumentImpl(); for (int i = 0; i < argv.length; i++) { DocumentFragment fragment = document.createDocumentFragment(); parser.parse(argv[i], fragment); print(fragment, ""); } } public static void print(Node node, String indent) { System.out.println(indent+node.getClass().getName()); Node child = node.getFirstChild(); while (child != null) { print(child, indent+" "); child = child.getNextSibling(); } } }
This source code is included in the src/sample/
directory.
Notice that the application parses a document fragment a little
bit differently than parsing a complete document. Instead of
initiating a parse by passing in a system identifier (or an
input source), parsing an HTML document fragment requires the
application to pass a DOM DocumentFragment
object
to the parse
method. The DOM fragment parser will
use the owner document of the DocumentFragment
as
the factory for parsed nodes. These nodes are then appended in
document order to the document fragment object.
Note:
In order for HTML DOM objects to be created, the document fragment
object passed to the parse
method should be created from
a DOM document object of type org.w3c.dom.html.HTMLDocument
.
Alternatively, you can construct any XNI-based parser class
using the HTMLConfiguration
parser configuration class
found in the org.cyberneko.html
package. The following
example shows how to extend the abstract SAX parser provided with
the Xerces2 implementation by passing the NekoHTML parser
configuration to the base class in the constructor.
package sample; import org.apache.xerces.parsers.AbstractSAXParser; import org.cyberneko.html.HTMLConfiguration; public class HTMLSAXParser extends AbstractSAXParser { public HTMLSAXParser() { super(new HTMLConfiguration()); } }
This source code is included in the src/sample/
directory.