Parser Settings

Configuring Parser

The application can set a variety of NekoHTML settings to more precisely control the behavior of the parser. These settings can be set directly on the HTMLConfiguration class or on the supplied parser classes by calling the setFeature and setProperty methods. For example:

// settings on HTMLConfiguration
org.apache.xerces.xni.parser.XMLParserConfiguration config =
  new org.cyberneko.html.HTMLConfiguration();
config.setFeature("http://cyberneko.org/html/features/augmentations", true);
config.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

// settings on DOMParser
org.cyberneko.html.parsers.DOMParser parser = 
  new org.cyberneko.html.parsers.DOMParser();
parser.setFeature("http://cyberneko.org/html/features/augmentations", true);
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

Features

The NekoHTML parser supports the following features:

Feature Id / Description	Default
http://xml.org/sax/features/namespaces Specifies if the NekoHTML parser should perform namespace processing. If enabled, namespace binding attributes are processed and elements and attributes are bound to the defined namespaces. See: http://cyberneko.org/html/features/override-namespaces	true
http://cyberneko.org/html/features/balance-tags Specifies if the NekoHTML parser should attempt to balance the tags in the parsed document. Balancing the tags fixes up many common mistakes by adding missing parent elements, automatically closing elements with optional end tags, and correcting unbalanced inline element tags. In order to process HTML documents as XML, this feature should not be turned off. This feature is provided as a performance enhancement for applications that only care about the appearance of specific elements, attributes, and/or content regardless of the document's ill-formed structure.	true
http://cyberneko.org/html/features/override-doctype Specifies whether the NekoHTML parser should override the public and system identifier values specified in the document type declaration. See: http://cyberneko.org/html/properties/doctype/pubid See: http://cyberneko.org/html/properties/doctype/sysid	false
http://cyberneko.org/html/features/insert-doctype Specifies whether the NekoHTML parser should insert a document type declaration into the document handler callbacks. The values for the public and system identifiers are taken from the sysid and pubid properties. Therefore, those properties should be set if this feature is turned on. Also, setting this feature to `true` will cause the parser to ignore any document type declaration that appears in the document. See: http://cyberneko.org/html/properties/doctype/pubid See: http://cyberneko.org/html/properties/doctype/sysid	false
http://cyberneko.org/html/features/override-namespaces Specifies whether the NekoHTML parser should override the namespace URI bound to HTML elements and attributes. See: http://cyberneko.org/html/properties/namespaces-uri	false
http://cyberneko.org/html/features/insert-namespaces Specifies whether the NekoHTML parser should insert namespace URI bindings to HTML elements and attributes. The value for the namespace URI is taken from the namespaces property. Therefore, that property should be set if this feature is turned on. See: http://cyberneko.org/html/properties/namespaces-uri	false
http://cyberneko.org/html/features/balance-tags/ignore-outside-content Specifies if the NekoHTML parser should ignore content after the end of the document root element. If this feature is set to true, all elements and character content appearing outside of the document body is consumed. If set to false, the end elements for the <body> and <html> are ignored, allowing content appearing outside of the document to be parsed and communicated to the application.	false
http://cyberneko.org/html/features/balance-tags/document-fragment Specifies if the tag balancer should operate as if a fragment of HTML is being parsed. With this feature set, the tag balancer will not attempt to insert a missing body elements around content and markup. However, proper parents for elements contained within the <body> element will still be inserted. This feature should not be used when using the `DOMParser` class. In order to parse a DOM `DocumentFragment`, use the `DOMFragmentParser` class.	false
http://cyberneko.org/html/features/scanner/normalize-attrs Specifies whether attribute values should be normalized according to section 3.3.3 of the XML 1.0 specification. When set to `false`, only end-of-line normalization and expansion of entities are performed. When set to `true`, leading and trailing whitespace is trimmed and consecutive whitespace is normalized to a single space character. Note: The raw attribute values can be queried by turning on the the augmentations feature and using XNI. See: http://cyberneko.org/html/features/augmentations	false
http://cyberneko.org/html/features/scanner/cdata-sections Specifies whether CDATA sections are reported as character content. If set to `false`, CDATA sections are reported as comments. When reported as comments, the comment text is prefixed with "[CDATA[" and end with "]]". This prefix and suffix is not included when reported as character content.	false
http://apache.org/xml/features/scanner/notify-char-refs Specifies whether character entity references (e.g. , , etc) should be reported to the registered document handler. The name of the entity reported will contain the leading pound sign and optional 'x' character. For example, the name of the character entity reference ` ` will be reported as "#x20".	false
http://apache.org/xml/features/scanner/notify-builtin-refs Specifies whether the XML built-in entity references (e.g. &, <, etc) should be reported to the registered document handler. This only applies to the five pre-defined XML general entities -- specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature. To be notified of the built-in entity references in HTML, set the `http://cyberneko.org/html/features/scanner/notify-builtin-refs` feature to `true`.	false
http://cyberneko.org/html/features/scanner/notify-builtin-refs Specifies whether the HTML built-in entity references (e.g. &nobr;, ©, etc) should be reported to the registered document handler. This includes the five pre-defined XML general entities.	false
http://cyberneko.org/html/features/scanner/fix-mswindows-refs Specifies whether to fix character entity references for Microsoft Windows® characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html.	false
http://cyberneko.org/html/features/scanner/ignore-specified-charset Specifies whether to ignore the character encoding specified within the <meta http-equiv='Content-Type' content='text/html;charset=...'> tag or in the <?xml … encoding='…'> processing instruction. By default, NekoHTML checks META tags for a charset and changes the character encoding of the scanning reader object. Setting this feature to `true` allows the application to override this behavior. See: http://cyberneko.org/html/properties/default-encoding	false
http://cyberneko.org/html/features/scanner/script/strip-comment-delims Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <script> element content. See: http://cyberneko.org/html/features/scanner/style/strip-comment-delims See: http://cyberneko.org/html/features/scanner/script/strip-cdata-delims	false
http://cyberneko.org/html/features/scanner/script/strip-cdata-delims Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <script> element content. See: http://cyberneko.org/html/features/scanner/style/strip-cdata-delims See: http://cyberneko.org/html/features/scanner/script/strip-comment-delims	false
http://cyberneko.org/html/features/scanner/style/strip-comment-delims Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <style> element content. See: http://cyberneko.org/html/features/scanner/script/strip-comment-delims See: http://cyberneko.org/html/features/scanner/style/strip-cdata-delims	false
http://cyberneko.org/html/features/scanner/style/strip-cdata-delims Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <style> element content. See: http://cyberneko.org/html/features/scanner/script/strip-cdata-delims See: http://cyberneko.org/html/features/scanner/style/strip-comment-delims	false
http://cyberneko.org/html/features/augmentations Specifies whether infoset items that correspond to the HTML events are included in the parsing pipeline. If included, the augmented item will implement the `HTMLEventInfo` interface found in the `org.cyberneko.html` package. The augmentations can be queried in XNI by calling the `getItem` method with the key "http://cyberneko.org/html/features/augmentations". Currently, the HTML event info augmentation can report event character boundaries and whether the event is synthesized.	false
http://cyberneko.org/html/features/report-errors Specifies whether errors should be reported to the registered error handler. Since HTML applications are supposed to permit the liberal use (and abuse) of HTML documents, errors should normally be handled silently. However, if the application wants to know about errors in the parsed HTML document, this feature can be set to `true`.	false
http://cyberneko.org/html/features/parse-noscript-content Specifies whether the content of a <noscript>...</noscript> node should be parsed or not. When set to `false` the content will be considered as plain text whereas when set to `true`, tags will be parsed normally.	true
http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe Specifies whether a self closing <iframe/> tag should be allowed or not. When set to `true` the parser won't look for a corresponding closing </iframe> tag.	false
http://cyberneko.org/html/features/scanner/allow-selfclosing-tags Specifies whether a self closing tag (e.g. <div/>) tag should be allowed or not. When set to `true` the parser won't look for a corresponding closing tag.	false

Properties

The NekoHTML parser supports the following properties:

Property Id / Description	Values	Default
http://cyberneko.org/html/properties/filters This property allows applications to append custom document processing components to the end of the default NekoHTML parser pipeline. The value of this property must be an array of type `org.apache.xerces.xni.parser.XMLDocumentFilter` and no value of this array is allowed to be null. The document filters are appended to the parser pipeline in array order. Please refer to the filters documentation for more information.	XMLDocumentFilter[]	null
http://cyberneko.org/html/properties/default-encoding Sets the default encoding the NekoHTML scanner should use when parsing documents. In the absence of an `http-equiv` directive in the source document, this setting is important because the parser does not have any support to auto-detect the encoding. See: http://cyberneko.org/html/features/scanner/ignore-specified-charset	IANA encoding names	"Windows-1252"
http://cyberneko.org/html/properties/names/elems Specifies how the NekoHTML components should modify recognized element names. Names can be converted to upper-case, converted to lower-case, or left as-is. The value of "match" specifies that element names are to be left as-is but the end tag name will be modified to match the start tag name. This is required to ensure that the parser generates a well-formed XML document.	"upper" "lower" "match"	"upper"
http://cyberneko.org/html/properties/names/attrs Specifies how the NekoHTML components should modify attribute names of recognized elements. Names can be converted to upper-case, converted to lower-case, or left as-is.	"upper" "lower" "no-change"	"lower"
http://cyberneko.org/html/properties/doctype/pubid Specifies the document type declaration public identifier if the `http://cyberneko.org/html/features/override-doctype` feature is set to `true`. The default value is the HTML 4.01 transitional public identifier, "-//W3C//DTD HTML 4.01 Transitional//EN". See: http://cyberneko.org/html/features/override-doctype	String	HTML 4.01 transitional public identifier
http://cyberneko.org/html/properties/doctype/sysid Specifies the document type declaration system identifier if the `http://cyberneko.org/html/features/override-doctype` feature is set to `true`. The default value is the HTML 4.01 transitional system identifier, "http://www.w3.org/TR/html4/loose.dtd". See: http://cyberneko.org/html/features/override-doctype	String	HTML 4.01 transitional system identifier
http://cyberneko.org/html/properties/namespaces-uri Specifies the namespace binding URI if the `http://cyberneko.org/html/features/override-namespaces` feature is set to `true`. The default value is the XHTML 1.0 namespace, "http://www.w3.org/1999/xhtml". This property does not affect the case of element and attributes names and does not ensure that the output of the NekoHTML parser is valid according to the XHTML specification. See: http://cyberneko.org/html/features/override-namespaces	String	XHTML 1.0 namespaces URI
Experimental http://cyberneko.org/html/properties/balance-tags/fragment-context-stack Specifies the stack of elements that should be considered as ancestors while parsing an HTML fragment. For instance when the last item of the context stack is a TABLE (or a TBODY, THEAD, TFOOT) following fragment will be parsed as a new row: <tr><td>hello</td></tr>. When the context doesn't indicate that we are already within a table, TR and TD tags will be discarded.	QName[]	null