The Xerces Native Interface (XNI) defines a parser configuration framework in which parsers can be written as a pipeline of modular components. This allows new parser configurations to be constructed by re-arranging existing components and/or writing custom components. And because the NekoHTML parser is written using this modular framework, new functionality can be quickly and easily added to the parser by appending custom document filters to the end of the default NekoHTML parsing pipeline.
To write a custom filter, simply write a new class that implements
the XMLDocumentFilter
interface from the
org.apache.xerces.xni.parser
package of Xerces2. This
interface allows the component to be both the handler of
document events from the previous stage in the pipeline as well as
the source for the next stage in the pipeline. The
implementation of the new filter is completely arbitrary; it can
remove events from the document stream, generate new events, or
anything else you want!
NekoHTML includes a base filter class to simplify the creation of
custom filters. To write a new filter, simply extend the
DefaultFilter
class located in the
org.cyberneko.html.filters
package and override the
relevent methods to add your own behavior. Once done, the only
thing you need to do is append the filter to the end of the
parser pipeline.
The NekoHTML parser has a filters
property that allows you to append custom document filters to
the end of the default parser pipeline. The value of this property
is an array of objects that implement the XMLDocumentFilter
interface in XNI. For example, the following code instantiates a
default filter and appends it to the parser pipeline:
XMLDocumentFilter noop = new DefaultFilter(); XMLDocumentFilter[] filters = { noop }; XMLParserConfiguration parser = new HTMLConfiguration(); parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
This section describes a few of the basic document filters that are included with the NekoHTML parser. The included filters enable applications to perform a variety of operations, including:
NekoHTML includes a simple HTML serializer written as a filter.
The Writer
class is located in the
org.cyberneko.html.filters
and contains two
different constructors. The default constructor creates a writer
that prints to the standard output. The other constructor allows
the application to control the output stream and the encoding.
For example:
// write to standard output using UTF-8 XMLDocumentFilter writer = new Writer(); // write to file with specified encoding OutputStream stream = new FileOutputStream("index.html"); String encoding = "ISO-8859-1"; XMLDocumentFilter writer = new Writer(stream, encoding);
Besides serializing the HTML event stream, the writer also passes the document events to the next stage in the pipeline. This allows applications to insert writer filters between other custom filters for debugging purposes.
Since an HTML document may have specified its encoding using the <META> tag and http-equiv/content attributes, the writer will automatically change any character set specified in this tag to match the encoding of the output stream. Therefore, the character encoding name used to construct the writer should be an official IANA encoding name and not a Java encoding name. Note: The modified character set in the <META> tag is not propagated to the next stage in the pipeline. The changed value is only output to the stream; the original value is sent to the next stage in the pipeline.
For convenience, the Writer
class contains a
main
method that allows you to run it as a program.
This can be used for debugging purposes in order to see what the
NekoHTML parser is generating as well as converting the character
encoding of existing documents.
The following table shows the standard usage of the writer:
Usage: | java org.cyberneko.html.filters.Writer (options) file ... |
---|---|
Options: | -e name Specify IANA name of output encoding. -i Perform identity transform. -p Purify output to ensure XML well-formedness. -h Display help screen. |
A filter to perform namespace processing is included with NekoHTML,
for convenience. You do not need to add this filter manually because
it is automatically added to the parsing pipeline if the SAX namespaces
feature is enabled. However, if you are interested, the
NamespaceBinder
component is included in the
org.cyberneko.html.filters
package.
Note: This component does not perform any namespace processing unless the SAX namespaces feature, "http://xml.org/sax/features/namespaces", is enabled.
HTML allows documents to be less strict than XML documents which means that most HTML documents cannot be parsed with an XML parser. But even if an HTML document can be parsed and accessed by applications using standard XML programming interfaces, many applications need to produce well-formed output. Not only do tags need to be balanced properly, but the document content must also be legal according to the XML specification. Therefore, the NekoHTML parser provides a filter that "purifies" the input, ensuring that the output is well-formed XML.
The Purifier
class in the
org.cyberneko.html.filters
package lets the application
convert the HTML input into well-formed XML output. Some of the
changes that the Purifier performs, are:
The NekoHTML parser also provides a basic document filter capable
of removing specified elements from the processing stream. The
ElementRemover
class is located in the
org.cyberneko.html.filters
package and provides
two options for processing document elements:
The first option allows the application to specify which elements appearing in the event stream should be accepted and, therefore, passed on to the next stage in the pipeline. All elements not in the list of acceptable elements have their start and end tags stripped from the event stream unless those elements appear in the list of elements to be removed.
The second option allows the application to specify which elements should be completely removed from the event stream. When an element appears that is to be removed, the element's start and end tag as well as all of that element's content is removed from the event stream.
A common use of this filter would be to only allow rich-text and linking elements as well as the character content to pass through the filter — all other elements would be stripped. The following code shows how to configure this filter to perform this task:
ElementRemover remover = new ElementRemover(); remover.acceptElement("b", null); remover.acceptElement("i", null); remover.acceptElement("u", null); remover.acceptElement("a", new String[] { "href" });
However, this would still allow the text content of other
elements to pass through, which may not be desirable. In order
to further "clean" the input, the removeElement
option can be used. The following piece of code adds the ability
to completely remove any <SCRIPT> tags and content
from the stream.
remover.removeElement("script");
This source code is included in the src/sample/
directory in the file named RemoveElements.java
.
Note:
When an element is "stripped", its start and end tags are
removed from the event stream. However, all of the element's
text content and elements (that are accepted) are not stripped.
To completely remove an element's content, use the
removeElement
method.
Note: Care should be taken when using this filter because the output may not be a well-balanced tree. Specifically, if the application removes the <HTML> element (with or without retaining its children), the resulting document event stream will no longer be well-formed.
An identity filter is provided that performs an identity operation of the original document event stream generated by the HTML scanner by removing events that are synthesized by the tag balancer. This operation is essentially the same as turning off tag-balancing in the parser. However, this filter is useful when you want the tag balancer to report "errors" but do not want the synthesized events in the output.
Note: This filter requires the augmentations feature to be turned on. For example:
XMLParserConfiguration parser = new HTMLConfiguration(); parser.setFeature("http://cyberneko.org/html/features/augmentations", true);
Note: This isn't exactly the identify transform because the element and attributes names may have been modified from the original document. For example, by default, NekoHTML converts element names to upper-case and attribute names to lower-case.
The NekoHTML parser has the ability to dynamically insert content into the parsed HTML document. This functionality can be used to insert the result of an embedded script (e.g. JavaScript) into the HTML document in place of the script element. Note: NekoHTML does not provide a scripting engine — only the ability to insert content to be parsed.
To insert content into the HTML document stream, call the
pushInputStream
method on the NekoHTML parser
configuration class. This method takes an XMLInputSource
object as a parameter. At the moment, the character stream
(java.io.Reader) of the input source must be
set or else the implementation will throw an illegal argument
exception.
A sample program called Script
is included in the
src/sample/ directory that demonstrates how to use of the
pushInputSource
method of the HTMLConfiguration in order
to dynamically insert content into the HTML stream.
This sample defines a new script language called "NekoScript"
that is a modified subset of the
NSGMLS format.
In this format, each line specifies a new command where each
command may indicate a start element tag, an attribute value,
character content, an end element tag, etc. The following table
enumerates the NSGMLS features supported by the NekoScript
language:
(name | A start element with the specified name. |
---|---|
"text | Character content with the specified text. |
)name | An end element with the specified name. |
When processed with the Script
filter, the following
document:
<script type='text/x-nekoscript'> (h1 "Header )h1 </script>
is equivalent to:
<H1>Header</H1>
as seen by the document handler registered with the parser.
The Script
class implements a main
method so that it can be run as a program. Running the program
produces the following output: [Note: The command
should be contiguous. It is split among separate lines in this
example to make it easier to read.]
> java -cp nekohtml.jar;nekohtmlSamples.jar;lib/xercesMinimal.jar sample.Script data/test33.html <H1>Header</H1>