Pipeline Filters

Overview
- Creating a New Filter
- Appending Filters to the Pipeline
Sample Filters

Overview

The Xerces Native Interface (XNI) defines a parser configuration framework in which parsers can be written as a pipeline of modular components. This allows new parser configurations to be constructed by re-arranging existing components and/or writing custom components. And because the NekoHTML parser is written using this modular framework, new functionality can be quickly and easily added to the parser by appending custom document filters to the end of the default NekoHTML parsing pipeline.

Creating a New Filter

To write a custom filter, simply write a new class that implements the XMLDocumentFilter interface from the org.apache.xerces.xni.parser package of Xerces2. This interface allows the component to be both the handler of document events from the previous stage in the pipeline as well as the source for the next stage in the pipeline. The implementation of the new filter is completely arbitrary; it can remove events from the document stream, generate new events, or anything else you want!

NekoHTML includes a base filter class to simplify the creation of custom filters. To write a new filter, simply extend the DefaultFilter class located in the org.cyberneko.html.filters package and override the relevent methods to add your own behavior. Once done, the only thing you need to do is append the filter to the end of the parser pipeline.

Appending Filters to the Pipeline

The NekoHTML parser has a filters property that allows you to append custom document filters to the end of the default parser pipeline. The value of this property is an array of objects that implement the XMLDocumentFilter interface in XNI. For example, the following code instantiates a default filter and appends it to the parser pipeline:

XMLDocumentFilter noop = new DefaultFilter();
XMLDocumentFilter[] filters = { noop };

XMLParserConfiguration parser = new HTMLConfiguration();
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

Sample Filters

This section describes a few of the basic document filters that are included with the NekoHTML parser. The included filters enable applications to perform a variety of operations, including:

serializing HTML documents;
ensuring XML well-formedness; and
performing identity transform.

Serializing HTML Documents

NekoHTML includes a simple HTML serializer written as a filter. The Writer class is located in the org.cyberneko.html.filters and contains two different constructors. The default constructor creates a writer that prints to the standard output. The other constructor allows the application to control the output stream and the encoding. For example:

// write to standard output using UTF-8
XMLDocumentFilter writer = new Writer();

// write to file with specified encoding
OutputStream stream = new FileOutputStream("index.html");
String encoding = "ISO-8859-1";
XMLDocumentFilter writer = new Writer(stream, encoding);

Besides serializing the HTML event stream, the writer also passes the document events to the next stage in the pipeline. This allows applications to insert writer filters between other custom filters for debugging purposes.

Since an HTML document may have specified its encoding using the <META> tag and http-equiv/content attributes, the writer will automatically change any character set specified in this tag to match the encoding of the output stream. Therefore, the character encoding name used to construct the writer should be an official IANA encoding name and not a Java encoding name. Note: The modified character set in the <META> tag is not propagated to the next stage in the pipeline. The changed value is only output to the stream; the original value is sent to the next stage in the pipeline.

For convenience, the Writer class contains a main method that allows you to run it as a program. This can be used for debugging purposes in order to see what the NekoHTML parser is generating as well as converting the character encoding of existing documents.

The following table shows the standard usage of the writer:

Usage: java org.cyberneko.html.filters.Writer (options) file ...
Options:
-e name Specify IANA name of output encoding. -i Perform identity transform. -p Purify output to ensure XML well-formedness. -h Display help screen.

Usage:	`java org.cyberneko.html.filters.Writer (options) file ...`
Options:	-e name Specify IANA name of output encoding. -i Perform identity transform. -p Purify output to ensure XML well-formedness. -h Display help screen.

Namespace Processing

A filter to perform namespace processing is included with NekoHTML, for convenience. You do not need to add this filter manually because it is automatically added to the parsing pipeline if the SAX namespaces feature is enabled. However, if you are interested, the NamespaceBinder component is included in the org.cyberneko.html.filters package.

Note: This component does not perform any namespace processing unless the SAX namespaces feature, "http://xml.org/sax/features/namespaces", is enabled.

Ensuring XML Well-Formedness

HTML allows documents to be less strict than XML documents which means that most HTML documents cannot be parsed with an XML parser. But even if an HTML document can be parsed and accessed by applications using standard XML programming interfaces, many applications need to produce well-formed output. Not only do tags need to be balanced properly, but the document content must also be legal according to the XML specification. Therefore, the NekoHTML parser provides a filter that "purifies" the input, ensuring that the output is well-formed XML.

The Purifier class in the org.cyberneko.html.filters package lets the application convert the HTML input into well-formed XML output. Some of the changes that the Purifier performs, are:

fixing illegal element and attribute names;
ensuring the string "--" does not appear in the content of a comment;
escaping illegal characters appearing in the document;
etc.

Removing Elements

The NekoHTML parser also provides a basic document filter capable of removing specified elements from the processing stream. The ElementRemover class is located in the org.cyberneko.html.filters package and provides two options for processing document elements:

specifying those elements which should be accepted and, optionally, which attributes of that element should be kept; and
specifying those elements whose tags and content should be completely removed from the event stream.

The first option allows the application to specify which elements appearing in the event stream should be accepted and, therefore, passed on to the next stage in the pipeline. All elements not in the list of acceptable elements have their start and end tags stripped from the event stream unless those elements appear in the list of elements to be removed.

The second option allows the application to specify which elements should be completely removed from the event stream. When an element appears that is to be removed, the element's start and end tag as well as all of that element's content is removed from the event stream.

A common use of this filter would be to only allow rich-text and linking elements as well as the character content to pass through the filter — all other elements would be stripped. The following code shows how to configure this filter to perform this task:

ElementRemover remover = new ElementRemover();
remover.acceptElement("b", null);
remover.acceptElement("i", null);
remover.acceptElement("u", null);
remover.acceptElement("a", new String[] { "href" });

However, this would still allow the text content of other elements to pass through, which may not be desirable. In order to further "clean" the input, the removeElement option can be used. The following piece of code adds the ability to completely remove any <SCRIPT> tags and content from the stream.

remover.removeElement("script");

This source code is included in the src/sample/ directory in the file named RemoveElements.java.

Note: When an element is "stripped", its start and end tags are removed from the event stream. However, all of the element's text content and elements (that are accepted) are not stripped. To completely remove an element's content, use the removeElement method.

Note: Care should be taken when using this filter because the output may not be a well-balanced tree. Specifically, if the application removes the <HTML> element (with or without retaining its children), the resulting document event stream will no longer be well-formed.

Performing Identity Transform

An identity filter is provided that performs an identity operation of the original document event stream generated by the HTML scanner by removing events that are synthesized by the tag balancer. This operation is essentially the same as turning off tag-balancing in the parser. However, this filter is useful when you want the tag balancer to report "errors" but do not want the synthesized events in the output.

Note: This filter requires the augmentations feature to be turned on. For example:

XMLParserConfiguration parser = new HTMLConfiguration();
parser.setFeature("http://cyberneko.org/html/features/augmentations", true);

Note: This isn't exactly the identify transform because the element and attributes names may have been modified from the original document. For example, by default, NekoHTML converts element names to upper-case and attribute names to lower-case.

Dynamically Inserting Content

The NekoHTML parser has the ability to dynamically insert content into the parsed HTML document. This functionality can be used to insert the result of an embedded script (e.g. JavaScript) into the HTML document in place of the script element. Note: NekoHTML does not provide a scripting engine — only the ability to insert content to be parsed.

To insert content into the HTML document stream, call the pushInputStream method on the NekoHTML parser configuration class. This method takes an XMLInputSource object as a parameter. At the moment, the character stream (java.io.Reader) of the input source must be set or else the implementation will throw an illegal argument exception.

A sample program called Script is included in the src/sample/ directory that demonstrates how to use of the pushInputSource method of the HTMLConfiguration in order to dynamically insert content into the HTML stream. This sample defines a new script language called "NekoScript" that is a modified subset of the NSGMLS format. In this format, each line specifies a new command where each command may indicate a start element tag, an attribute value, character content, an end element tag, etc. The following table enumerates the NSGMLS features supported by the NekoScript language:

(name A start element with the specified name.
"text Character content with the specified text.
)name An end element with the specified name.

`(name`	A start element with the specified name.
`"text`	Character content with the specified text.
`)name`	An end element with the specified name.

When processed with the Script filter, the following document:

<script type='text/x-nekoscript'>
(h1
"Header
)h1
</script>

is equivalent to:

<H1>Header</H1>

as seen by the document handler registered with the parser.

The Script class implements a main method so that it can be run as a program. Running the program produces the following output: [Note: The command should be contiguous. It is split among separate lines in this example to make it easier to read.]

> java -cp nekohtml.jar;nekohtmlSamples.jar;lib/xercesMinimal.jar 
       sample.Script data/test33.html
<H1>Header</H1>