HTML Tidy for HTML5 (experimental)

This page documents the experimental HTML5 fork of HTML Tidy available at https://github.com/w3c/tidy-html5.

File bug reports and enhancement requests at https://github.com/w3c/tidy-html5/issues.

The W3C public mailing list for HTML Tidy discussion is html-tidy@w3.org (list archive).

For more information on HTML5:

Validate your HTML documents using the W3C Nu Markup Validator.

What Tidy does

Tidy corrects and cleans up HTML content by fixing markup errors. Here are a few examples:

How to run Tidy from the command line

This is the syntax for invoking Tidy from the command line:

   tidy [[options] filename]*

Tidy defaults to reading from standard input, so if you run Tidy without specifying the filename argument, it will just sit there waiting for input to read. And Tidy defaults to writing to standard output. So you can pipe output from Tidy to other programs, as well as pipe output from other programs to Tidy. You can page through the output from Tidy by piping it to a pager:

   tidy file.html | less

To have Tidy write its output to a file instead, either use the -o filename or -output filename option, or redirect standard output to the file; for example:

   tidy -o output.html index.html
   tidy index.html > output.html

Both of those run tidy on the file index.html and write the output to the file output.html, while writing any error messages to standard error.

Tidy defaults to writing its error messages to standard error (that is, to the console where you’re running Tidy). To page through the error messages, along with the output, redirect standard error to standard output, and pipe it to your pager:

   tidy index.html 2>&1 | less

To have Tidy write the errors to a file instead, either use the -f filename or -file filename option, or redirect standard error to a file:

   tidy -o output.html -f errs.txt index.html
   tidy index.html > output.html 2> errs.txt 

Both of those run tidy on the file index.html and write the output to the file output.html, while writing any error messages to the file errs.txt.

Writing the error messages to a file is especially useful if the file you are checking has many errors; reading them from a file instead of the console or pager can make it easier to review them.

You can use the or -m or -modify option to modify (in-place) the contents of the input file you are checking; that is, to overwrite those contents with the output from Tidy. Example:

   tidy -f errs.txt -m index.html

That runs tidy on the file index.html, modifying it in place and writing the error messages to the file errs.txt.

Caution: If you use the -m option, you should first save a copy of your file.

Options and configuration settings

To get a list of available options, use:

   tidy -help

To get a list of all configuration settings, use:

   tidy -help-config

To read the help output a page at time, pipe it to a pager:

   tidy -help | less
   tidy -help-config | less

Single-letter options other than -f may be combined; for example:

  tidy -f errs.txt -imu foo.html

Using a config file

The most convenient way to configure Tidy is by using separate config file. Assuming you have created a Tidy config file named config.txt (the name doesn't matter), you can instruct Tidy to use it via the command line option -config config.txt; for example:

   tidy -config config.txt file1.html file2.html

Alternatively, you can name the default config file via the environment variable named HTML_TIDY, the value of which is the absolute path for the config file.

You can also set config options on the command line by preceding the name of the option immediately (no intervening space) with the string "--"; for example:

  tidy --break-before-br true --show-warnings false

You can find documentation for full set of configuration options on the Quick Reference page.

Sample config file

The following is an example of a Tidy config file.

// sample config file for HTML tidy
indent: auto
indent-spaces: 2
wrap: 72
markup: yes
output-xml: no
input-xml: no
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: no
break-before-br: no
uppercase-tags: no
uppercase-attributes: no
char-encoding: latin1
new-inline-tags: cfif, cfelse, math, mroot, 
  mrow, mi, mn, mo, msqrt, mfrac, msubsup, munderover,
  munder, mover, mmultiscripts, msup, msub, mtext,
  mprescripts, mtable, mtr, mtd, mth
new-blocklevel-tags: cfoutput, cfquery
new-empty-tags: cfelse

New configuration options

The experimental HTML5-aware fork of Tidy adds the following new configuration options:

In addition, it also adds a new html5 value for the doctype configuration option.

Indenting output for readability

Indenting the source markup of an HTML document makes the markup easier to read. Tidy can indent the markup for an HTML document while recognizing elements whose contents should not be indented. In the example below, Tidy indents the output while preserving the formatting of the <pre> element:

Input:

 <html>
 <head>
 <title>Test document</title>
 </head>
 <body>
 <p>This example shows how Tidy can indent output while preserving
 formatting of particular elements.</p>
 
 <pre>This is
 <em>genuine
       preformatted</em>
    text
 </pre>
 </body>
 </html>
 

Output:

<html>
  <head>
    <title>Test document</title>
  </head>

  <body>
    <p>This example shows how Tidy can indent output while preserving
    formatting of particular elements.</p>
<pre>
This is
<em>genuine
       preformatted</em>
   text
</pre>
  </body>
</html>

Tidy’s indenting behavior is not perfect and can sometimes cause your output to be rendered by browsers in a different way than the input. You can avoid unexpected indenting-related rendering problems by setting indent: no or indent: auto in a config file.

Preserving original indenting not possible

Tidy is not capable of preserving the original indenting of the markup from the input it receives. That’s because Tidy starts by building a clean parse tree from the input, and that parse tree doesn’t contain any information about the original indenting. Tidy then pretty-prints the parse tree using the current config settings. Trying to preserve the original indenting from the input would interact badly with the repair operations needed to build a clean parse tree, and would considerably complicate the code.

Encodings and character references

Tidy defaults to assuming you want output to be encoded in UTF-8. But Tidy offers you a choice of other character encodings: US ASCII, ISO Latin-1, and the ISO 2022 family of 7 bit encodings.

Tidy doesn't yet recognize the use of the HTML <meta> element for specifying the character encoding.

The full set of HTML character references are defined. Cleaned-up output uses named character references for characters when appropriate. Otherwise, characters outside the normal range are output as numeric character references.

Accessibility

Tidy offers advice on potential accessibility problems for people using non-graphical browsers.

Cleaning up presentational markup

Some tools generate HTML with presentational elements such as <font>, <nobr>, and <center>. Tidy's -clean option will replace those elements with CSS style properties.

Some HTML documents rely on the presentational effects of <p> start tags that are not followed by any content. Tidy deletes such <p> tags (as well as any headings that don’t have content). So do not use <p> tags simply for adding vertical whitespace; instead use CSS, or the <br> element. However, note that Tidy won’t discard <p> tags that are followed by any nonbreaking space (that is, the &nbsp; named character reference).

Teaching Tidy about new tags

You can teach Tidy about new tags by declaring them in the configuration file, the syntax is:

  new-inline-tags: tag1, tag2, tag3
  new-empty-tags: tag1, tag2, tag3
  new-blocklevel-tags: tag1, tag2, tag3
  new-pre-tags: tag1, tag2, tag3

The same tag can be defined as empty and as inline or as empty and as block.

These declarations can be combined to define a new empty inline or empty block element. But you are not advised to declare tags as being both inline and block.

Note that the new tags can only appear where Tidy expects inline or block-level tags respectively. That means you can’t place new tags within the document head or other contexts with restricted content models.

Ignoring PHP, ASP, and JSTE instructions

Tidy will gracefully ignore many cases of PHP, ASP, and JSTE instructions within element content and as replacements for attributes, and preserve them as-is in output; for example:

  <option <% if rsSchool.Fields("ID").Value
    = session("sessSchoolID")
    then Response.Write("selected") %>
    value='<%=rsSchool.Fields("ID").Value%>'>
    <%=rsSchool.Fields("Name").Value%>
    (<%=rsSchool.Fields("ID").Value%>)
  </option>

But note that Tidy may report missing attributes when those are “hidden” within the PHP, ASP, or JSTE code. If you use PHP, ASP, or JSTE code to create a start tag, but place the end tag explicitly in the HTML markup, Tidy won’t be able to match them up, and will delete the end tag. So in that case you are advised to make the start tag explicit and to use PHP, ASP, or JSTE code for just the attributes; for example:

   <a href="<%=random.site()%>">do you feel lucky?</a>

Tidy can also get things wrong if the PHP, ASP, or JSTE code includes quotation marks; for example:

    value="<%=rsSchool.Fields("ID").Value%>"

Tidy will see the quotation mark preceding ID as ending the attribute value, and proceed to complain about what follows.

Tidy allows you to control whether line wrapping on spaces within PHP, ASP, and JSTE instructions is enabled; see the wrap-php, wrap-asp, and wrap-jste config options.

Correcting well-formedness errors in XML markup

Tidy can help you to correct well-formedness errors in XML markup. Tidy doesn't yet recognize all XML features, though; for example, it doesn't understand CDATA sections or DTD subsets.

Using Tidy from scripts

If you want to run Tidy from a Perl or other scripting language you may find it of value to inspect the result returned by Tidy when it exits: 0 if everything is fine, 1 if there were warnings and 2 if there were errors. This is an example using Perl:

if (close(TIDY) == 0) {
  my $exitcode = $? >> 8;
  if ($exitcode == 1) {
    printf STDERR "tidy issued warning messages\n";
  } elsif ($exitcode == 2) {
    printf STDERR "tidy issued error messages\n";
  } else {
    die "tidy exited with code: $exitcode\n";
  }
} else {
  printf STDERR "tidy detected no errors\n";
}

Source code

The source code for the experimental HTML5 fork of Tidy can be found at https://github.com/w3c/tidy-html5.

Building the tidy command-line tool

For Linux/BSD/OSX platforms, you can build and install the tidy command-line tool from the source code using the following steps.

  1. make -C build/gmake/
  2. make install -C build/gmake/

Note that you will either need to run make install as root, or with sudo make install.

Building the libtidy shared library

For Linux/BSD/OSX platforms, you can build and install the tidylib shared library (for integrating Tidy into other applications) from the source code using the following steps.

  1. sh build/gnuauto/setup.sh && ./configure && make
  2. make install

Note that you will either need to run make install as root, or with sudo make install.

Acknowledgements

Dave Raggett has a list of Acknowledgements for people who made suggestions or reported bugs for the original version of Tidy.

Show TOC
Close
  1. What Tidy does
  2. How to run Tidy from the command line
  3. Options and configuration settings
  4. Using a config file
  5. Sample config file
  6. New configuration options
  7. Indenting output for readability
  8. Preserving original indenting not possible
  9. Encodings and character references
  10. Accessibility
  11. Cleaning up presentational markup
  12. Teaching Tidy about new tags
  13. Ignoring PHP, ASP, and JSTE instructions
  14. Correcting well-formedness errors in XML markup
  15. Using Tidy from scripts
  16. Source code
  17. Building the tidy command-line tool
  18. Building the tidylib shared library
  19. Acknowledgements
QuickRef