Handling Numeric XML Entities in a Weblog Move

I'm exporting a Radio UserLand weblog to Movable Type for a client, turning Radio's XML archive of weblog entries into a Movable Type import file. I wrote a Java application that employs the XOM XML library to read Radio's weblog data.

Some numeric character entities in Radio's XML data threw me for a loop: â (’), À (¿), Ž (é), ‡ (á) and — (ó). They were transformed -- either by XOM or the Xerces XML parser that it uses -- into garbage characters that display incorrectly in Movable Type.

After fumbling around, I found a solution: Read a weblog entry's XML data as a text file, replace the numeric XML entities with the equivalent numeric HTML entities and parse the resulting file with XOM:

// replace bad character entities with good ones
public void prepareFile(String source) throws IOException {
  File sourceFile = new File(directory + source);
  BufferedReader reader = new BufferedReader(new FileReader(sourceFile));
  File destination = new File("input.xml");
  BufferedWriter writer = new BufferedWriter(new FileWriter(destination));
  String text = "";
  do {
    text = text.replaceAll("â", "’"); // curly single quote mark
    text = text.replaceAll("À", "¿"); // upside down question mark
    text = text.replaceAll("Ž", "é"); // lowercase accented e
    text = text.replaceAll("‡", "á"); // lowercase accented a
    text = text.replaceAll("—", "ó"); // lowercase accented o
    if (!text.equals("")) {
      writer.write(text, 0, text.length());
      writer.newLine();
    }
    text = reader.readLine();
  } while (text != null);
  reader.close();
  writer.close();
}

This is a clumsy solution that relies on escaped markup to produce the HTML entities, but I can't find a better one without editing the client's Radio data by hand. I'm trying to avoid that, because I want to use this application to move other weblogs.

Radio UserLand saves an XML backup of all weblog posts and categories in the software's backups\weblog\Archive subdirectory. If you're using Radio, enable the Archiving in XML preferences to take advantage of this feature, which makes it easier to export the data to another weblog publishing program.

Comments

WordPress includes a function called convert_chars (in the file functions-formatting.php under wp-includes) that converts invalid Unicode references (usually from Windows CP 1252) to valid range.

In the very least the function will show you the common characters to translate between.

> They were transformed -- either by XOM or the Xerces XML parser that it uses -- into garbage characters that display incorrectly in Movable Type.

That's because they *are* garbage characters.

Numeric character references refer to Unicode character positions. But those numeric character references are referring to Windows-1252 character set codepoints, not Unicode codepoints. This is wrong even when the character encoding is Windows-1252. No matter what the character encoding of a document, numeric character references *always* refer to Unicode codepoints.

Radio's just plain broken if it includes these numeric character references.

It looks like those characters are from MacRoman.

You might want to fill out this filter with more characters, and have this filter only apply to people who are migrating from Radio Userland on the Mac.

For Windows users, another a different set of mappings would be necessary.

Some of the characters coming out of Radio do appear to be MacRoman (and the client was running the software on a Mac). But others don't match up, so I dealt with the characters individually. Here's my kloodge for all of the bad characters I found:

text = text.replaceAll("â", "’"); // curly close single quote
text = text.replaceAll("ã", "“"); // curly open quote
text = text.replaceAll("ä", "”"); // curly close quote
text = text.replaceAll("ð", "—"); // em-dash
text = text.replaceAll("ö", "—"); // em-dash
text = text.replaceAll("÷", "—"); // em-dash
text = text.replaceAll("·", "—"); // em-dash
text = text.replaceAll("Á", "¡"); // upside down exclamation point
text = text.replaceAll("À", "¿"); // upside down question mark
text = text.replaceAll("‡", "á"); // lowercase accented a
text = text.replaceAll("ç", "Á"); // uppercase accented a
text = text.replaceAll("Ë", "À"); // uppercase reverse accented a
text = text.replaceAll("Ž", "é"); // lowercase accented e
text = text.replaceAll("ƒ", "É"); // uppercase accented e
text = text.replaceAll("–", "ñ"); // lowercase n with tilde
text = text.replaceAll("—", "ó"); // lowercase accented o
text = text.replaceAll("î", "Ó"); // uppercase accented o
text = text.replaceAll("£", "£"); // pound sign
text = text.replaceAll("¨", "©"); // copyright symbol
text = text.replaceAll("¡", "°"); // degree sign
text = text.replaceAll("Ê", " "); // junk

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).