Parser Settings

Get NekoHTML at SourceForge.net. Fast, secure and Free Open Source software downloads

Configuring Parser

The application can set a variety of NekoHTML settings to more precisely control the behavior of the parser. These settings can be set directly on the HTMLConfiguration class or on the supplied parser classes by calling the setFeature and setProperty methods. For example:

// settings on HTMLConfiguration
org.apache.xerces.xni.parser.XMLParserConfiguration config =
  new org.cyberneko.html.HTMLConfiguration();
config.setFeature("http://cyberneko.org/html/features/augmentations", true);
config.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

// settings on DOMParser
org.cyberneko.html.parsers.DOMParser parser = 
  new org.cyberneko.html.parsers.DOMParser();
parser.setFeature("http://cyberneko.org/html/features/augmentations", true);
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

Features

The NekoHTML parser supports the following features:
Feature Id / DescriptionDefault
http://xml.org/sax/features/namespaces
Specifies if the NekoHTML parser should perform namespace processing. If enabled, namespace binding attributes are processed and elements and attributes are bound to the defined namespaces.

See: http://cyberneko.org/html/features/override-namespaces

true
http://cyberneko.org/html/features/balance-tags
Specifies if the NekoHTML parser should attempt to balance the tags in the parsed document. Balancing the tags fixes up many common mistakes by adding missing parent elements, automatically closing elements with optional end tags, and correcting unbalanced inline element tags. In order to process HTML documents as XML, this feature should not be turned off. This feature is provided as a performance enhancement for applications that only care about the appearance of specific elements, attributes, and/or content regardless of the document's ill-formed structure.
true
http://cyberneko.org/html/features/override-doctype
Specifies whether the NekoHTML parser should override the public and system identifier values specified in the document type declaration.

See: http://cyberneko.org/html/properties/doctype/pubid
See: http://cyberneko.org/html/properties/doctype/sysid

false
http://cyberneko.org/html/features/insert-doctype
Specifies whether the NekoHTML parser should insert a document type declaration into the document handler callbacks. The values for the public and system identifiers are taken from the sysid and pubid properties. Therefore, those properties should be set if this feature is turned on. Also, setting this feature to true will cause the parser to ignore any document type declaration that appears in the document.

See: http://cyberneko.org/html/properties/doctype/pubid
See: http://cyberneko.org/html/properties/doctype/sysid

false
http://cyberneko.org/html/features/override-namespaces
Specifies whether the NekoHTML parser should override the namespace URI bound to HTML elements and attributes.

See: http://cyberneko.org/html/properties/namespaces-uri

false
http://cyberneko.org/html/features/insert-namespaces
Specifies whether the NekoHTML parser should insert namespace URI bindings to HTML elements and attributes. The value for the namespace URI is taken from the namespaces property. Therefore, that property should be set if this feature is turned on.

See: http://cyberneko.org/html/properties/namespaces-uri

false
http://cyberneko.org/html/features/balance-tags/ignore-outside-content
Specifies if the NekoHTML parser should ignore content after the end of the document root element. If this feature is set to true, all elements and character content appearing outside of the document body is consumed. If set to false, the end elements for the <body> and <html> are ignored, allowing content appearing outside of the document to be parsed and communicated to the application.
false
http://cyberneko.org/html/features/balance-tags/document-fragment
Specifies if the tag balancer should operate as if a fragment of HTML is being parsed. With this feature set, the tag balancer will not attempt to insert a missing body elements around content and markup. However, proper parents for elements contained within the <body> element will still be inserted. This feature should not be used when using the DOMParser class. In order to parse a DOM DocumentFragment, use the DOMFragmentParser class.
false
http://cyberneko.org/html/features/scanner/normalize-attrs
Specifies whether attribute values should be normalized according to section 3.3.3 of the XML 1.0 specification. When set to false, only end-of-line normalization and expansion of entities are performed. When set to true, leading and trailing whitespace is trimmed and consecutive whitespace is normalized to a single space character. Note: The raw attribute values can be queried by turning on the the augmentations feature and using XNI.

See: http://cyberneko.org/html/features/augmentations

false
http://cyberneko.org/html/features/scanner/cdata-sections
Specifies whether CDATA sections are reported as character content. If set to false, CDATA sections are reported as comments. When reported as comments, the comment text is prefixed with "[CDATA[" and end with "]]". This prefix and suffix is not included when reported as character content.
false
http://apache.org/xml/features/scanner/notify-char-refs
Specifies whether character entity references (e.g. &#32;, &#x20;, etc) should be reported to the registered document handler. The name of the entity reported will contain the leading pound sign and optional 'x' character. For example, the name of the character entity reference &#x20; will be reported as "#x20".
false
http://apache.org/xml/features/scanner/notify-builtin-refs
Specifies whether the XML built-in entity references (e.g. &amp;, &lt;, etc) should be reported to the registered document handler. This only applies to the five pre-defined XML general entities -- specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature. To be notified of the built-in entity references in HTML, set the http://cyberneko.org/html/features/scanner/notify-builtin-refs feature to true.
false
http://cyberneko.org/html/features/scanner/notify-builtin-refs
Specifies whether the HTML built-in entity references (e.g. &nobr;, &copy;, etc) should be reported to the registered document handler. This includes the five pre-defined XML general entities.
false
http://cyberneko.org/html/features/scanner/fix-mswindows-refs
Specifies whether to fix character entity references for Microsoft Windows® characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html.
false
http://cyberneko.org/html/features/scanner/ignore-specified-charset
Specifies whether to ignore the character encoding specified within the <meta http-equiv='Content-Type' content='text/html;charset=...'> tag or in the <?xml … encoding='…'> processing instruction. By default, NekoHTML checks META tags for a charset and changes the character encoding of the scanning reader object. Setting this feature to true allows the application to override this behavior.

See: http://cyberneko.org/html/properties/default-encoding

false
http://cyberneko.org/html/features/scanner/script/strip-comment-delims
Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <script> element content.

See: http://cyberneko.org/html/features/scanner/style/strip-comment-delims
See: http://cyberneko.org/html/features/scanner/script/strip-cdata-delims

false
http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <script> element content.

See: http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
See: http://cyberneko.org/html/features/scanner/script/strip-comment-delims

false
http://cyberneko.org/html/features/scanner/style/strip-comment-delims
Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <style> element content.

See: http://cyberneko.org/html/features/scanner/script/strip-comment-delims
See: http://cyberneko.org/html/features/scanner/style/strip-cdata-delims

false
http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <style> element content.

See: http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
See: http://cyberneko.org/html/features/scanner/style/strip-comment-delims

false
http://cyberneko.org/html/features/augmentations
Specifies whether infoset items that correspond to the HTML events are included in the parsing pipeline. If included, the augmented item will implement the HTMLEventInfo interface found in the org.cyberneko.html package. The augmentations can be queried in XNI by calling the getItem method with the key "http://cyberneko.org/html/features/augmentations". Currently, the HTML event info augmentation can report event character boundaries and whether the event is synthesized.
false
http://cyberneko.org/html/features/report-errors
Specifies whether errors should be reported to the registered error handler. Since HTML applications are supposed to permit the liberal use (and abuse) of HTML documents, errors should normally be handled silently. However, if the application wants to know about errors in the parsed HTML document, this feature can be set to true.
false
http://cyberneko.org/html/features/parse-noscript-content
Specifies whether the content of a <noscript>...</noscript> node should be parsed or not. When set to false the content will be considered as plain text whereas when set to true, tags will be parsed normally.
true
http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
Specifies whether a self closing <iframe/> tag should be allowed or not. When set to true the parser won't look for a corresponding closing </iframe> tag.
false
http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
Specifies whether a self closing tag (e.g. <div/>) tag should be allowed or not. When set to true the parser won't look for a corresponding closing tag.
false

Properties

The NekoHTML parser supports the following properties:
Property Id / DescriptionValuesDefault
http://cyberneko.org/html/properties/filters
This property allows applications to append custom document processing components to the end of the default NekoHTML parser pipeline. The value of this property must be an array of type org.apache.xerces.xni.parser.XMLDocumentFilter and no value of this array is allowed to be null. The document filters are appended to the parser pipeline in array order. Please refer to the filters documentation for more information.
XMLDocumentFilter[] null
http://cyberneko.org/html/properties/default-encoding
Sets the default encoding the NekoHTML scanner should use when parsing documents. In the absence of an http-equiv directive in the source document, this setting is important because the parser does not have any support to auto-detect the encoding.

See: http://cyberneko.org/html/features/scanner/ignore-specified-charset

IANA encoding names "Windows-1252"
http://cyberneko.org/html/properties/names/elems
Specifies how the NekoHTML components should modify recognized element names. Names can be converted to upper-case, converted to lower-case, or left as-is. The value of "match" specifies that element names are to be left as-is but the end tag name will be modified to match the start tag name. This is required to ensure that the parser generates a well-formed XML document.
"upper"
"lower"
"match"
"upper"
http://cyberneko.org/html/properties/names/attrs
Specifies how the NekoHTML components should modify attribute names of recognized elements. Names can be converted to upper-case, converted to lower-case, or left as-is.
"upper"
"lower"
"no-change"
"lower"
http://cyberneko.org/html/properties/doctype/pubid
Specifies the document type declaration public identifier if the http://cyberneko.org/html/features/override-doctype feature is set to true. The default value is the HTML 4.01 transitional public identifier, "-//W3C//DTD HTML 4.01 Transitional//EN".

See: http://cyberneko.org/html/features/override-doctype

String HTML 4.01 transitional public identifier
http://cyberneko.org/html/properties/doctype/sysid
Specifies the document type declaration system identifier if the http://cyberneko.org/html/features/override-doctype feature is set to true. The default value is the HTML 4.01 transitional system identifier, "http://www.w3.org/TR/html4/loose.dtd".

See: http://cyberneko.org/html/features/override-doctype

String HTML 4.01 transitional system identifier
http://cyberneko.org/html/properties/namespaces-uri
Specifies the namespace binding URI if the http://cyberneko.org/html/features/override-namespaces feature is set to true. The default value is the XHTML 1.0 namespace, "http://www.w3.org/1999/xhtml". This property does not affect the case of element and attributes names and does not ensure that the output of the NekoHTML parser is valid according to the XHTML specification.

See: http://cyberneko.org/html/features/override-namespaces

String XHTML 1.0 namespaces URI
Experimental http://cyberneko.org/html/properties/balance-tags/fragment-context-stack
Specifies the stack of elements that should be considered as ancestors while parsing an HTML fragment. For instance when the last item of the context stack is a TABLE (or a TBODY, THEAD, TFOOT) following fragment will be parsed as a new row: <tr><td>hello</td></tr>. When the context doesn't indicate that we are already within a table, TR and TD tags will be discarded.
QName[] null