Change History
Releases
Elements
- Version 1.9.21 (2 Jun 2014)
- Ensure that closing unknown element only closes matching unknown element and not any unknown element,
added definition for HTML5 tag
SECTION
.
- Version 1.9.20 (13 Feb 2014)
- Fix IllegalArgumentException occurring with entities having invalid UTF-16 code and use replacement character (�) instead,
fix ArrayIndexOutOfBoundsException occurring when stream ends with a \r in attribute's value (#154).
- Version 1.9.19 (9 Oct 2013)
- Element
LI
closes DIV
,
handle Unicode supplementary character (#3609978, patch from Dan Rabe),
change LABEL
to inline element (#152),
fixed resource leak (#151).
- Version 1.9.18 (27 Feb 2013)
- Elements
ADDRESS
, CENTER
, DD
, DIR
, DL
, DT
, FIELDSET
, LISTING
, LI
, MENU
, OL
, PRE
, UL
, and XMP
close P
(#3595486, patch from Ahmed Ashour),
TR
closes DIV
,
accept more characters as attribute names (like "?" or ";").
- Version 1.9.17 (5 Nov 2012)
- Don't rely on default locale for lowercase/uppercase conversion of tag and attribute names (#3544334, based on patch provided by Ronald Brill),
accept only
FRAME
, FRAMESET
, and NOFRAMES
within FRAMESET
(fix StackOverflowError #3555034),
add HEAD
before FRAMESET
when missing,
recognize encoding specified in META charset='...'
,
fix StackOverflowError occurring with content after closing BODY
tag with feature http://cyberneko.org/html/features/balance-tags/ignore-outside-content
set to true
(#3490807),
TABLE
doesn't close inline elements anymore (#3527659, reverting fix for #2019307).
- Version 1.9.16 (18 Jul 2012)
-
Add new feature
http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
to allow self closing tags (e.g. <DIV>
) (default is false) (#3519597, patch provided by Ronald Brill),
allow block elements within BUTTON
(#3524099, patch provided by Ronald Brill).
- Version 1.9.15 (3 Aug 2011)
-
Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745),
change
INS
to inline element,
change BUTTON
to inline element.
don't parse body of IFRAME
,
add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
to allow empty IFRAME
tags (default is false),
make detected encoding available as Locator2.getEncoding() (#3381270).
- Version 1.9.14 (2 Feb 2010)
-
Don't parse body of
NOFRAMES
(fixes StackOverflowError reported in #2854697),
TABLE
can have multiple THEAD
, TBODY
and TFOOT
(patch provided by Ahmed Ashour, #2893796),
trim encoding found in meta
tag (#2904817),
fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs
(#2838901),
recognize tags even if the >
of the opening tag is missing (#2886227),
only end TABLE
can close a table (#2913095),
fix StackOverflowError when parsing document fragment (#2911449),
fix NullPointerException occurring with the insert-namespaces
feature (#2942363).
- Version 1.9.13 (2 Sept 2009)
[zip]
[tgz]
- A
A
tag should close opened A
tag (fixes OutOfMemoryException reported in #2780607),
not all entities with missing semicolon should be recognized as entities (different browsers behave differently therefore
until we have something like "browser profiles", NekoHTML will only recognized incomplete entities with code less than 256),
automatically add TBODY
around TR
nested directly within TABLE
,
TD
and TH
are container and should force closing of nested tags like DIV
,
don't generate incorrect warnings about missing closing BODY
and HTML
tags when feature http://cyberneko.org/html/features/balance-tags/ignore-outside-content
is off,
don't generate opening TABLE
or SELECT
tag when it's missing (therefore discard child nodes),
added experimental tag balancer property http://cyberneko.org/html/properties/balance-tags/fragment-context-stack
allowing to to specify the current tag stack when evaluating a new input source during parsing,
ignore whitespaces that are direct children of HTML
,
org.cyberneko.html.filters.Writer
wrongly changes attribute value of some META
tags (#2815779),
use http://www.w3.org/1999/xhtml
namespace for automatically inserted tags (#2799585),
DOMFragmentParser
ignores invalid processing instruction and attribute name (#2828553, #2828534),
define latest Xerces version (2.9.1) as dependency in the published Maven POM,
end of parent element forces closing of unknown elements (#2816091),
handle new lines directly in HTMLScanner.scanCharacters to avoid multiple calls (patch provided by Sean Bridges, #2829319),
fix ArrayIndexOutOfBoundsException occurring using strip-comment-delims
feature (#2837555).
- Version 1.9.12 (20 Apr 2009)
[zip]
[tgz]
- fixed NPE when parsing from a Character stream (patch provided by Ludger B�nger, #2503982),
when closing comment --> is missing, comment ends with > (patch provided by Tatsuhiko Miyabe, #2552096),
don't treat tags with non-HTML namespace like HTML tags (patch provided by Tatsuhiko Miyabe, #2551958),
force creation of BODY rather than of HEAD for unknown tags,
fix incorrect HTMLEventInfo augmentations for script content (patch provided by Louis Ryan, #2236681),
add HEAD and BODY tags when missing (#1898038),
ignore encoding specified in xml declaration if feature
http://cyberneko.org/html/features/scanner/ignore-specified-charset
is set to true
(#2529933),
hold text found before HTML to feed it first once BODY is opened,
when strip-cdata-delims and strip-comment-delims are on, CDATA delimiters should be trimmed even if CDATA is located within a comment (#2671917),
fixed infinite loop while searching end comment signs (patch provided by H. J. Hill, #2671480),
features http://cyberneko.org/html/features/scanner/style/strip-comment-delims
and http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
should not strip the whole line after opening delimiter (#2767233).
- Version 1.9.11 (29 Dec 2008)
[zip]
[tgz]
- fixed regression introduced in 1.9.10 with LI incorrectly closing too much parent nodes,
search for element to close on stack should stop on block node not on container node,
- Version 1.9.10 (22 Dec 2008)
[zip]
[tgz]
- SELECT wrongly closes (and reopens) inline tags (patch provided by Ahmed Ashour, #2146829),
add character offsets in HTMLEventInfo augmentations (patch provided by Ian Roberts, #2128228),
use encoding provided in XML declaration to decode the stream (#1918243),
recognize entities even if trailing semicolon is missing (patch provided by Tatsuhiko Miyabe, #1981792),
handle P like SPAN and allow it to be contained in A, B, I, U, ... (fixes #2421765),
ignore xml declaration attributes for which no value could be scanned (fixes #2421775).
- Version 1.9.9 (11 Sept 2008)
[zip]
[tgz]
- Fixed bugs #2059466 and #2051091 (accepting unknown tags within inline elements as well as as containers, don't accept any container in head),
#2039483 (wrong augmentation when attribute value contains a newline, patch from Ian Roberts),
#2039915 (failed skip() does not back up columnNumber, patch from Ian Roberts),
added new feature
http://cyberneko.org/html/features/parse-noscript-content
to turn off <noscript> content parsing,
#2094515 FONT tag should be considered like inline tags and reopened when closing was forced by an inline element to respect tag balancing.
- Version 1.9.8 (22 Jul 2008)
[zip]
[tgz]
- Fixed bugs #1949460 (handling of uppercase 'X' for entities in hexadecimal format),
#1951703 (isEncodingCompatible throws UnsupportedOperationException when default encoding is JISAutoDetect, patch from Tatsuhiko Miyabe),
#1965055 (java.lang.OutOfMemoryError: Java heap space with quotes in JavaScript comments),
#1990307 (evaluateInputSource parses only partially, patch from kkolman),
ignore self closing tags for tags that accept a body (generalisation of what was done in 1.9.7 with the FORM tag),
unknown tags can't be container,
improved parsing of script body (comments and tags),
#2019307 (table closes inline elements),
#1903899 (ErrorHandlerWrapper missing in xercesMinimal.jar).
- Version 1.9.7 (5 Apr 2008)
[zip]
[tgz]
- Call methods that differ between Xerces versions using a bridge pro Xerces version rather than reflection.
ignore closing instruction in <form .../>,
fixed issues 1922810, 1927966, 1926831 (thanks to Ahmed Ashour)
- Version 1.9.6.2 (17 Mar 2008)
[zip]
[tgz]
- Accept comments and tags in <title> element as browsers do (patch provided by Ahmed Ashour, issue 1879598),
discard form opening tag when it is within an other form,
added experimental possibility to immediately evaluate a new input source during parsing
(patch provided by Ahmed Ashour, issue 1895946).
added experimental interface to enable handlers to be notified of ignored opening and closing tags.
Finally fix compatibility with Java 1.3.
- Version 1.9.6.1 (23 Jan 2008)
[zip]
[tgz]
- Fixed charset regression reported by Marc Guillemot.
- Version 1.9.6 (14 Dec 2007)
[zip]
[tgz]
- Changed license to Apache 2.0;
boosted the version number to reflect the maturity of the project;
re-organized project files to decouple it from the rest of the
CyberNeko Tools for XNI;
updated xercesMinimal.jar and source so that NekoHTML compiles using
Xerces-J 2.9.1;
changed default behavior to not normalize attribute values
and added new feature to allow user to turn on normalization;
modified build to target compilation for Java 1.3 as suggested by
Jacob Kjome;
adjusted <p> tag-balancing suggested by Jacob Kjome;
and
fixed issues 1723287, 1746732, and 1790414.
- Version 0.9.5 (18 Jun 2005)
[zip]
[tgz]
- Added feature submitted by Asgeir Asgeirsson to allow scanner to fix
character entity references for Microsoft Windows® characters;
stopped building nekohtmlXni.jar file by default;
fixed handling of <blockquote> reported by Joseph Walton
to better match browser behavior;
fixed tag-balancing bug for unknown elements reported by Marc
Guillemot and Vadim Tashlikovich;
fixed mapping of encoding name in
<meta>
element
reported by Marc Guillemot;
changed tag-balancing to allow headers inside of links suggested
by Laurens Fridael;
applied attribute namespace patch from Joseph Walton;
fixed namespace bug for "xml" prefixes reported by Asgeir
Asgeirsson;
fixed namespace bug for "xmlns" prefixes reported by
Johannes Koch;
and
fixed no-such-method exception bug when using augmentations feature
with older versions of Xerces2 reported by Hans Donner.
- Version 0.9.4 (17 Nov 2004)
[zip]
[tgz]
- Fixed typo in proviso 5 of the license agreement;
added features to strip CDATA delimiters (i.e. "<![CDATA[" and
"]]>") from <script> and <style> elements suggested by Dan Sojka;
fixed tag-balancing problem reported by Egor Samarkhanov;
applied augmentations patches donated by Marc-Andr� Morissette;
implemented augmentation performance enhancements inspired by
Marc-Andr� Morissette;
fixed ignore-outside-content bug reported by Chris Erskine;
and
updated link to Xerces download site.
- Version 0.9.3 (30 Jun 2004)
[zip]
[tgz]
- Implemented scanning of XML declaration;
fixed <script> tag scanning bug reported by Vasiliev Ivan;
added
Version
class and manifest entries to query
product information;
and fixed some Javadoc errors.
- Version 0.9.2 (31 Mar 2004)
[zip]
[tgz]
- Fixed entity reference scanning and tag-balancing bugs identified
by Tommy Sandstrom;
fixed tag-balancing bug reported by Oliver Pfeiffer;
fixed doctype scanning bug reported by Jonathan Baxter;
updated Purifier filter to synthesize missing namespace bindings;
updated Writer filter to convert all known characters back to
their entity names;
and
updated implementation to work with Xerces-J 2.6.2 that removed
the
ObjectFactory
class in the
org.apache.xerces.util
package.
- Version 0.9.1 (24 Feb 2004)
[zip]
[tgz]
- Fixed namespace binding bug reported by Jonathan Baxter.
- Version 0.9 (19 Feb 2004)
[zip]
[tgz]
- Implemented scanning of CDATA sections;
implemented namespace processing;
added features to
override namespace bindings,
insert namespace bindings if not present,
override doctype public and system identifiers, and
insert doctype declaration if not present;
added a filter to allow applications to "purify" the input, ensuring
that the output is well-formed XML;
added missing location augmentations from document type declaration
callback;
fixed newline scanning bugs reported by Jonathan Baxter;
and
fixed comment scanning bugs and infinite loop bug caused by extremely
long element and attribute names found by Ram Subbaroyan.
- Version 0.8.3 (12 Dec 2003)
[zip]
[tgz]
- Fixed null pointer exception for <frameset> tags reported by
Dawid Weiss;
and
added missing file to xercesMinimal.jar file reported by Brent Beardsley.
- Version 0.8.2 (14 Nov 2003)
[zip]
[tgz]
- Fixed array index out of bounds exception in special tags and
doctype scanning bug reported by Leo Galambos;
updated processing instruction scanning to handle weird PIs exported
from Microsoft products as reported by Gabriele Bulfon;
fixed erroneous reporting of missing whitespace before attributes
reported by Arno Schatz;
installed a default error handler that prints to standard error
suggested by Arno Schatz;
and
fixed handling of dangling </p> reported by Gopi Murthy to
better match browser behavior.
- Version 0.8.1 (30 Sep 2003)
[zip]
[tgz]
- Fixed bug reported by Yuan Ji that allowed multiple <html> tags;
fixed bug in stripping leading comments in <script> tags
as reported by Lawrence McCartin;
added feature to be able to strip HTML comment delimiters (i.e. "<!--"
and "-->") from <style> element content suggested by
Lawrence McCartin;
updated DOMParser to work around a bug in the Xerces HTML DOM
implementation when a doctype node was inserted into the document,
reported by Troy Waldrep;
updated the DOMFragmentParser to allow setting of features and
properties as requested by Paul Reeves;
changed the status of the document fragment parser from experimental
to supported;
added feature to allow application to ignore a character encoding
specified in a <meta http-equiv='Content-Type'
content='text/html;charset=...'> tag requested by Roger Fullerton;
and
changed feature identifier for document fragment tag balancing to
be more in line with other features (but retained old feature
identifier for backwards compatibility).
- Version 0.8 (05 Aug 2003)
[zip]
[tgz]
- Implemented scanning of doctype declaration;
implemented non-normalized attribute value for XNI filters that want
to know original attribute value;
fixed bug scanning entity references inside of unquoted attributes;
fixed line counting bug in attribute values reported by Arno Schatz;
and
updated files in xercesMinimal.jar noted by Brent Beardsley.
- Version 0.7.7 (25 Jun 2003)
[zip]
[tgz]
- Fixed handling of <font> tags reported by Dave King;
fixed bugs that caused multiple <head> and <body> tags
as reported by Mike Bowler;
fixed missing <tr> bug in nested tables reported by Troy Waldrep;
and
normalized newlines in attribute values to spaces.
- Version 0.7.6 (06 May 2003)
[zip]
[tgz]
- Fixed infinite loop in special tags reported by Mike Bowler.
- Version 0.7.5 (02 May 2003)
[zip]
[tgz]
- Fixed parsing of entity reference within <textarea> tags reported
by Mattias Jiderhamn;
changed behavior of tag balancer to not consume content after the end
<body> and <html> tags but retained old behavior through
new feature;
fixed <noscript> bug reported by Takashi Tomokiyo;
and
updated implementation for XNI changes introduced in Xerces-J 2.4.0.
- Version 0.7.4 (03 Mar 2003)
[zip]
[tgz]
- Fixed <form> element balancing problem reported by Dan Rocco;
fixed null pointer exception reported by Michael Dynin that was
caused by a null XMLResourceIdentifier object passed to the
startGeneralEntity method in the Xerces DOM parser classes;
fixed handling of <font> element as requested by Arno Schatz
to better match current browsers;
replaced generic catch exception blocks with explicit catch blocks
suggested by Arno Schatz;
fixed <center> tag-balancing problem reported by Russell Gold;
fixed null pointer exception caused by null namespace context
object passed to Xerces SAX parser class reported by David Leslie;
and
added FAQ entry describing how to insert custom filters before
the tag-balancer.
- Version 0.7.3 (28 Jan 2003)
[zip]
[tgz]
- Updated implementation for XNI changes introduced in Xerces-J 2.3.0;
and
fixed hack string to accommodate XML4J build of Xerces included in
the Eclipse editor reported by Geoffrey Longman.
- Version 0.7.2 (10 Jan 2003)
[zip]
[tgz]
- Fixed class-cast exception bug in DOMFragmentParser reported by
Joseph Artsimovich;
fixed <span> tag-balancing bug reported by Ron Cemer;
and
fixed handling of form tags missing a parent element reported by
Russell Gold in order to better match browser behavior.
- Version 0.7.1 (06 Dec 2002)
[zip]
[tgz]
- Fixed null pointer exception caused by null attributes object
passed to Xerces SAX parser class as reported by Kevin Huber;
and
fixed infinite loop condition when encountering "</html[eof]"
as reported by Matt Hurst.
- Version 0.7 (27 Nov 2002)
[zip]
[tgz]
- Changed behavior of tag balancer for unbalanced elements
as requested by Troy Waldrep to make output match that
produced by browsers such as Mozilla;
fixed other tag balancing problems identified by a bug
reported by Laurens Fridael;
added experimental HTML fragment
parsing feature and DOM fragment parser class;
fixed buffer boundary bug in skipMarkup method reported
by Mike Bowler;
added constructor to the Writer filter that accepts a
Java writer object parameter as requested by Alain
Gilbert;
fixed HTMLScanner class so that it can compile with JDK 1.1
as reported by Mikko Honkala;
and
fixed bug reported by Russell Gold that would ignore
the <param> element within an <applet>
element.
- Version 0.6.8 (30 Sep 2002)
[zip]
[tgz]
- Implemented scanning of processing instructions;
improved performance of HTMLElements#getElement method inspired
by Sam Cheung;
changed tag balancer algorithm as requested by Mike Bowler so
that it does not close the <body> element to insert a
proper parent element;
fixed <isindex> proper parent bug and <script> empty
element tag bug reported by Mike Bowler;
fixed bug reported by YingLCS that a <form> tag
would prematurely close a <p> tag;
and
updated implementation for XNI changes introduced in Xerces-J 2.2.0.
- Version 0.6.7 (06 Sep 2002)
[zip]
[tgz]
- Added a FAQ section;
and
updated implementation for XNI changes introduced in Xerces-J 2.1.0.
- Version 0.6.6 (25 Aug 2002)
[zip]
[tgz]
- Changed packaging to include product name and version in
directory name;
updated
HTMLConfiguration
to implement the
XMLPullParserConfiguration
interface;
fixed bug reported by Martin Jericho to correct handling
of <col> element;
fixed bug reported by Dave King that would skip to end of
document if bad markup was found;
fixed numerous bugs related to scanning <script> tags
reported by Sam Cheung;
added feature to be able to strip HTML comment delimiters (i.e.
"<!--" and "-->") from <script> element content;
changed the status of the feature to dynamically insert
content from experimental to
supported;
added code to be able to compare test files against canonical
output for regression testing;
and
fixed minor bugs found by the tests.
- Version 0.6.5 (17 Jul 2002)
[zip]
[tgz]
- Fixed bug in changing character encoding when "charset=..." is
not written in lowercase;
and
mark attributes as "specified".
- Version 0.6.4 (15 Jun 2002)
[zip]
[tgz]
- Re-organized package contents for integration into the CyberNeko
Tools for XNI package;
fixed table closing bug reported by Oskar Liljeblad;
fixed newline bug reported by OtisG;
and
fixed line counting bug reported by Donald Ball.
- Version 0.6.3 (29 May 2002)
[zip]
[tgz]
- Fixed bug in handling of <th> elements reported by
Oskar Liljeblad;
and
fixed various tag-balancing problems.
- Version 0.6.2 (26 May 2002)
[zip]
[tgz]
- Changed scanner behavior as requested by Alexey Shananin to
report malformed start elements (e.g. <...>) as
characters
and
fixed tag balancing bug introduced in previous version. Oops!
- Version 0.6.1 (23 May 2002)
[zip]
[tgz]
- Changed tag balancer behavior to swallow events after the close
of the <html> tag to ensure that the document stream
remains well-formed;
added additional Ruby elements;
and
improved tag balancer performance.
- Version 0.6 (12 May 2002)
[zip]
[tgz]
- Added property to allow custom document filters to be appended
to the default NekoHTML parser pipeline;
added convenience filters for serializing HTML documents and
removing elements from the document event stream;
added samples to demonstrate the filtering feature;
added experimental functionality to
allow applications to dynamically insert content into the
HTML document stream;
added a minimal Xerces2 Jar file containing just the files
required for using the HTMLConfiguration class directly to
alleviate full dependence on Xerces2 distribution;
applied patch from Serge Proskuryakov to fix handling of
misplaced <title> within <body>;
fixed minor tag balancing bug;
and
re-organized and added new documentation.
- Version 0.5 (07 May 2002)
[zip]
[tgz]
- Fixed some location reporting information bugs and added
feature to report character boundaries of events via the
associated augmentations object;
added feature to disable tag balancing;
and
added features to notify handlers of start and end of character
and built-in XML and HTML entity references.
- Version 0.4.1 (03 May 2002)
[zip]
[tgz]
- Fixed some unquoted attribute value scanning bugs reported
by Xiaowei Jiang;
fixed hack for Xerces-J 2.0.1 reported by Ron Cemer;
now passing locator object to
startDocument
method;
and
celebrated opening of the Spider-Man movie.
- Version 0.4 (14 Apr 2002)
[zip]
[tgz]
- Added properties to control case of element and attribute names;
changed behavior of parser so that only known HTML elements
have their names modified according to the properties — all
unknown tags are left as-is;
added property to set default encoding;
added feature to augment infoset to report "synthesized" events;
added feature to be able to report errors and localized the error
messages;
implemented the locator so that location information can be
reported;
and
fixed element information so that more elements are properly
scanned as "special".
- Version 0.3.3 (02 Apr 2002)
[zip]
[tgz]
- Separated META-INF/services/* files to separate Jar
so that HTML parser configuration selection can be controlled
more explicitly; added DOM and SAX parser classes for
convenience; and fixed bug so that parser now obeys the
encoding specified in the input source.
- Version 0.3.2 (15 Mar 2002)
[zip]
[tgz]
- Fixed problem with bare <input> elements appearing outside
of <form> tag.
- Version 0.3.1 (07 Mar 2002)
[zip]
[tgz]
- Fixed handling of bare ampersands in content and attribute
values.
- Version 0.3 (25 Feb 2002)
[zip]
[tgz]
- Changed license to an Apache style license and fixed a
few bugs.
- Version 0.2.3 (19 Feb 2002)
[zip]
[tgz]
- Nested tables bug fix.
- Version 0.2.2 (17 Feb 2002)
[zip]
[tgz]
- More bug fixes to allow the parser to be used with Xalan
2.3.0. The parser wasn't keeping track of features and
properties and without namespaces turned on, Xalan would
not correctly transform the SAX events emitted using
NekoHTML.
- Version 0.2.1 (16 Feb 2002)
[zip]
[tgz]
- Minor bug fix to work around problem in Xerces-J 2.0.0 SAX
parser that drops attributes when parser configuration
doesn't have a symbol table.
- Version 0.2 (14 Feb 2002)
[zip]
[tgz]
- Adding support for UTF-8, UTF-16, and other 8-bit encodings
supported by Java.
- Version 0.1 (04 Feb 2002)
[zip]
[tgz]
- Initial writing.
(C) Copyright 2002-2009, Andy Clark, Marc Guillemot. All rights reserved.