The application can set a variety of NekoHTML settings to more
precisely control the behavior of the parser. These settings
can be set directly on the HTMLConfiguration
class
or on the supplied parser classes by calling the
setFeature
and setProperty
methods.
For example:
// settings on HTMLConfiguration org.apache.xerces.xni.parser.XMLParserConfiguration config = new org.cyberneko.html.HTMLConfiguration(); config.setFeature("http://cyberneko.org/html/features/augmentations", true); config.setProperty("http://cyberneko.org/html/properties/names/elems", "lower"); // settings on DOMParser org.cyberneko.html.parsers.DOMParser parser = new org.cyberneko.html.parsers.DOMParser(); parser.setFeature("http://cyberneko.org/html/features/augmentations", true); parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
Feature Id / Description | Default |
---|---|
http://xml.org/sax/features/namespaces
Specifies if the NekoHTML parser should perform namespace processing. If enabled, namespace binding attributes are processed and elements and attributes are bound to the defined namespaces. | true |
http://cyberneko.org/html/features/balance-tags
Specifies if the NekoHTML parser should attempt to balance the tags in the parsed document. Balancing the tags fixes up many common mistakes by adding missing parent elements, automatically closing elements with optional end tags, and correcting unbalanced inline element tags. In order to process HTML documents as XML, this feature should not be turned off. This feature is provided as a performance enhancement for applications that only care about the appearance of specific elements, attributes, and/or content regardless of the document's ill-formed structure. | true |
http://cyberneko.org/html/features/override-doctype
Specifies whether the NekoHTML parser should override the public and system identifier values specified in the document type declaration.
See: http://cyberneko.org/html/properties/doctype/pubid
| false |
http://cyberneko.org/html/features/insert-doctype
Specifies whether the NekoHTML parser should insert a document type declaration into the document handler callbacks. The values for the public and system identifiers are taken from the sysid and pubid properties. Therefore, those properties should be set if this feature is turned on. Also, setting this feature to true
will cause the parser to ignore any document type declaration that
appears in the document.
See: http://cyberneko.org/html/properties/doctype/pubid
| false |
http://cyberneko.org/html/features/override-namespaces
Specifies whether the NekoHTML parser should override the namespace URI bound to HTML elements and attributes. | false |
http://cyberneko.org/html/features/insert-namespaces
Specifies whether the NekoHTML parser should insert namespace URI bindings to HTML elements and attributes. The value for the namespace URI is taken from the namespaces property. Therefore, that property should be set if this feature is turned on. | false |
http://cyberneko.org/html/features/balance-tags/ignore-outside-content
Specifies if the NekoHTML parser should ignore content after the end of the document root element. If this feature is set to true, all elements and character content appearing outside of the document body is consumed. If set to false, the end elements for the <body> and <html> are ignored, allowing content appearing outside of the document to be parsed and communicated to the application. | false |
http://cyberneko.org/html/features/balance-tags/document-fragment
Specifies if the tag balancer should operate as if a fragment of HTML is being parsed. With this feature set, the tag balancer will not attempt to insert a missing body elements around content and markup. However, proper parents for elements contained within the <body> element will still be inserted. This feature should not be used when using the DOMParser
class. In order to parse a DOM DocumentFragment , use the
DOMFragmentParser class.
| false |
http://cyberneko.org/html/features/scanner/normalize-attrs
Specifies whether attribute values should be normalized according to section 3.3.3 of the XML 1.0 specification. When set to false , only
end-of-line normalization and expansion of entities are performed.
When set to true , leading and trailing whitespace is
trimmed and consecutive whitespace is normalized to a single space
character.
Note:
The raw attribute values can be queried by turning on the the
augmentations feature and using XNI.
| false |
http://cyberneko.org/html/features/scanner/cdata-sections
Specifies whether CDATA sections are reported as character content. If set to false , CDATA sections are reported as comments.
When reported as comments, the comment text is prefixed with "[CDATA["
and end with "]]". This prefix and suffix is not
included when reported as character content.
| false |
http://apache.org/xml/features/scanner/notify-char-refs
Specifies whether character entity references (e.g.  ,  , etc) should be reported to the registered document handler. The name of the entity reported will contain the leading pound sign and optional 'x' character. For example, the name of the character entity reference   will be reported as "#x20".
| false |
http://apache.org/xml/features/scanner/notify-builtin-refs
Specifies whether the XML built-in entity references (e.g. &, <, etc) should be reported to the registered document handler. This only applies to the five pre-defined XML general entities -- specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature. To be notified of the built-in entity references in HTML, set the http://cyberneko.org/html/features/scanner/notify-builtin-refs
feature to true .
| false |
http://cyberneko.org/html/features/scanner/notify-builtin-refs
Specifies whether the HTML built-in entity references (e.g. &nobr;, ©, etc) should be reported to the registered document handler. This includes the five pre-defined XML general entities. | false |
http://cyberneko.org/html/features/scanner/fix-mswindows-refs
Specifies whether to fix character entity references for Microsoft Windows® characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html. | false |
http://cyberneko.org/html/features/scanner/ignore-specified-charset
Specifies whether to ignore the character encoding specified within the <meta http-equiv='Content-Type' content='text/html;charset=...'> tag or in the <?xml … encoding='…'> processing instruction. By default, NekoHTML checks META tags for a charset and changes the character encoding of the scanning reader object. Setting this feature to true allows the application to override this behavior.
| false |
http://cyberneko.org/html/features/scanner/script/strip-comment-delims
Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <script> element content.
See: http://cyberneko.org/html/features/scanner/style/strip-comment-delims
| false |
http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <script> element content.
See: http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
| false |
http://cyberneko.org/html/features/scanner/style/strip-comment-delims
Specifies whether the scanner should strip HTML comment delimiters (i.e. "<!--" and "-->") from <style> element content.
See: http://cyberneko.org/html/features/scanner/script/strip-comment-delims
| false |
http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
Specifies whether the scanner should strip XHTML CDATA delimiters (i.e. "<![CDATA[" and "]]>") from <style> element content.
See: http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
| false |
http://cyberneko.org/html/features/augmentations
Specifies whether infoset items that correspond to the HTML events are included in the parsing pipeline. If included, the augmented item will implement the HTMLEventInfo interface found in the
org.cyberneko.html package. The augmentations
can be queried in XNI by calling the getItem
method with the key
"http://cyberneko.org/html/features/augmentations".
Currently, the HTML event info augmentation can report event
character boundaries and whether the event is synthesized.
| false |
http://cyberneko.org/html/features/report-errors
Specifies whether errors should be reported to the registered error handler. Since HTML applications are supposed to permit the liberal use (and abuse) of HTML documents, errors should normally be handled silently. However, if the application wants to know about errors in the parsed HTML document, this feature can be set to true .
| false |
http://cyberneko.org/html/features/parse-noscript-content
Specifies whether the content of a <noscript>...</noscript> node should be parsed or not. When set to false the content will be considered as plain text whereas when set to true ,
tags will be parsed normally. |
true |
http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
Specifies whether a self closing <iframe/> tag should be allowed or not. When set to true the parser won't look for a corresponding closing </iframe> tag. |
false |
http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
Specifies whether a self closing tag (e.g. <div/>) tag should be allowed or not. When set to true the parser won't look for a corresponding closing tag. |
false |
Property Id / Description | Values | Default |
---|---|---|
http://cyberneko.org/html/properties/filters
This property allows applications to append custom document processing components to the end of the default NekoHTML parser pipeline. The value of this property must be an array of type org.apache.xerces.xni.parser.XMLDocumentFilter
and no value of this array is allowed to be null. The document
filters are appended to the parser pipeline in array order.
Please refer to the filters
documentation for more information.
| null | |
http://cyberneko.org/html/properties/default-encoding
Sets the default encoding the NekoHTML scanner should use when parsing documents. In the absence of an http-equiv directive in the source document,
this setting is important because the parser does not
have any support to auto-detect the encoding.
See: http://cyberneko.org/html/features/scanner/ignore-specified-charset | IANA encoding names | |
http://cyberneko.org/html/properties/names/elems
Specifies how the NekoHTML components should modify recognized element names. Names can be converted to upper-case, converted to lower-case, or left as-is. The value of "match" specifies that element names are to be left as-is but the end tag name will be modified to match the start tag name. This is required to ensure that the parser generates a well-formed XML document. | "upper" "lower" "match" | "upper" |
http://cyberneko.org/html/properties/names/attrs
Specifies how the NekoHTML components should modify attribute names of recognized elements. Names can be converted to upper-case, converted to lower-case, or left as-is. | "upper" "lower" | "lower" |
http://cyberneko.org/html/properties/doctype/pubid
Specifies the document type declaration public identifier if the http://cyberneko.org/html/features/override-doctype
feature is set to true . The default value is the HTML
4.01 transitional public identifier, "-//W3C//DTD HTML 4.01 Transitional//EN".
| String | HTML 4.01 transitional public identifier |
http://cyberneko.org/html/properties/doctype/sysid
Specifies the document type declaration system identifier if the http://cyberneko.org/html/features/override-doctype
feature is set to true . The default value is the HTML
4.01 transitional system identifier, "http://www.w3.org/TR/html4/loose.dtd".
| String | HTML 4.01 transitional system identifier |
http://cyberneko.org/html/properties/namespaces-uri
Specifies the namespace binding URI if the http://cyberneko.org/html/features/override-namespaces
feature is set to true . The default value is the XHTML
1.0 namespace, "http://www.w3.org/1999/xhtml". This property does
not affect the case of element and attributes names and
does not ensure that the output of the NekoHTML parser is
valid according to the XHTML specification.
| String | XHTML 1.0 namespaces URI |
Experimental
http://cyberneko.org/html/properties/balance-tags/fragment-context-stack
Specifies the stack of elements that should be considered as ancestors while parsing an HTML fragment. For instance when the last item of the context stack is a TABLE (or a TBODY, THEAD, TFOOT) following fragment will be parsed as a new row: <tr><td>hello</td></tr>. When the context doesn't indicate that we are already within a table, TR and TD tags will be discarded. |
QName[] | null |