This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Comparison of HTML parsers" – news · newspapers · books · scholar · JSTOR(May 2015) (Learn how and when to remove this message)
This article possibly contains original research. Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed.(May 2015) (Learn how and when to remove this message)
(Learn how and when to remove this message)
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser
License
Implementation language(s)
Latest date*
HTML parsing[1]
HTML5-compliant parsing
Clean HTML**
Update HTML***
HTML Tidy
W3C license
ANSI C
2021-07-17[2]
Yes[3]
Yes
Yes[3]
Yes
HtmlUnit
Apache License 2.0
Java
2023-10-31[4]
Yes
?
No
No
Beautiful Soup
MIT License
Python
2023-04-07[5]
Yes
Yes
?
No
jsoup
MIT License
Java
2023-12-29[6]
Yes
Yes
Yes
Yes
Parser
License
Implementation language(s)
Latest date*
HTML Parsing
HTML5-compliant Parsing
Clean HTML**
Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").
^12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine
^HTML Tidy release 5.8.0
^ abWhat is Tidy?
^HtmlUnit 3.7.0
^Beautiful Soup release 4.10
^jsoup Java HTML Parser release 1.17.2
and 27 Related for: Comparison of HTML parsers information
HTMLparsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes: HTML traversal: offer an interface for...
2023-11-04. "Building parsers for the web with JavaCC & GWT (Part one)". Chris Ainsley. 14 April 2014. Retrieved 2014-05-04. "The Lemon Parser Generator". sqlite...
Cellpadding ComparisonofHTMLparsers Dynamic web page HTML character references List of document markup languages List of XML and HTML character entity...
An HTML element is a type ofHTML (HyperText Markup Language) document component, one of several types ofHTML nodes (there are also text nodes, comment...
xmllint and an HTMLparser. Free and open-source software portal libxslt (the LibXML2's XSLT module) XML validation ComparisonofHTMLparsers Expat (library)...
encodings into HTML entities Free and open-source software portal ComparisonofHTMLparsers Error: Unable to display the reference properly. See the documentation...
application of XML, a more restrictive subset of SGML. XHTML documents are well-formed and may therefore be parsed using standard XML parsers, unlike HTML, which...
data by scraping millions of public reachable websites, also without reading and accepting those terms. ComparisonofHTMLparsers "What is SEO and how it...
browsers, parsers, etc., without XHTML's rigidity; and to remain backward-compatible with older software. HTML5 is intended to subsume not only HTML 4 but...
language-independent interface that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents...
HTML email is the use of a subset ofHTML to provide formatting and semantic markup capabilities in email that are not available with plain text: Text...
browsers are implemented with special-purpose HTMLparsers, rather than general-purpose DTD-based parsers, they do not use DTDs and never access them even...
by the parser. The handling of badly formed code now has a place in the specification itself, hopefully reducing the need for future HTMLparsers to implement...
HTML video is a subject of the HTML specification as the standard way of playing video via the web. Introduced in HTML5, it is designed to partially replace...
descent parsers in which the structure of the code performing the parsing mirrors the structure of the XML being parsed, and intermediate parsed results...
semi-structured data query languages, such as XQuery and the HTQL, can be used to parseHTML pages and to retrieve and transform page content. By embedding a full-fledged...
Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple...
fully validated by validating SGML or XML parsers in their standalone mode (this means that these validating parsers do not attempt to retrieve these external...
(i.e. not a superset of ASCII), such as UTF-16BE and UTF-16LE, a processor ofHTML, such as a web browser, should be able to parse the declaration in some...
versions of YAML were not strictly compatible, the discrepancies were rarely noticeable, and most JSON documents can be parsed by some YAML parsers such as...
Actions. It also uses the idea of separate parsers, e.g., for parsing the wiki syntax, and formatters, e.g., for outputting HTML code, with a SAX-like interface...
versions may still be marketed. Comparisonof open-source mobile phones List of custom Android distributions Comparisonof satellite navigation software...
This is a comparisonof both historical and current web browsers based on developer, engine, platform(s), releases, license, and cost. Basic general information...
articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external...
in Haskell, parses Markdown (in two forms) and ReStructuredText, as well as HTML and LaTeX; it writes from any of these formats to HTML, RTF, LaTeX,...