Function to Remove Extended Characters

January 21, 2005 by Pete Freitag

I wrote a ColdFusion function today that I thought I would share. What it does is replace extended, or unicode characters with HTML/XML character entities. EG: the character à becomes à.

I wrote this for an RSS feed that had a few unicode characters in it, but the majority of the feed was us-ascii. Rather than changing the encoding, I opted to replace those few chars with an ascii safe XML representation.

Here's The function:

<cffunction name="EscapeExtendedChars" returntype="string">
	<cfargument name="str" type="string" required="true">
	<cfset var buf = CreateObject("java", "java.lang.StringBuffer")>
	<cfset var len = Len(arguments.str)>
	<cfset var char = "">
	<cfset var charcode = 0>
	<cfset buf.ensureCapacity(JavaCast("int", len+20))>
	<cfif NOT len>
		<cfreturn arguments.str>
	</cfif>
	<cfloop from="1" to="#len#" index="i">
		<cfset char = arguments.str.charAt(JavaCast("int", i-1))>
		<cfset charcode = JavaCast("int", char)>
		<cfif (charcode GT 31 AND charcode LT 127) OR charcode EQ 10
			OR charcode EQ 13 OR charcode EQ 9>
				<cfset buf.append(JavaCast("string", char))>
		<cfelse>
			<cfset buf.append(JavaCast("string", "&##"))>
			<cfset buf.append(JavaCast("string", charcode))>
			<cfset buf.append(JavaCast("string", ";"))>
		</cfif>
	</cfloop>
	<cfreturn buf.toString()>
</cffunction>

I'm making use of Java's StringBuffer class, and also the charAt method of java.lang.String. I think this code is a pretty fast solution, since it avoid appending strings by hand, and I would guess the charAt method may be a bit faster than using the builtin CFML Mid function.

The Fixinator Code Security Scanner for ColdFusion & CFML is an easy to use security tool that every CF developer can use. It can also easily integrate into CI for automatic scanning on every commit.

Comments

Ryan Guill January 21, 2005

You should submit this to cflib, it looks like it will be quite useful.

Nolan January 21, 2005

this is great! i have lots of Marketing folks that like to copy/paste from MS Word into my CMS engine. Now I have an extra level of formatting things correctly. thanks!

paulh January 21, 2005

can't say that i agree with ripping out prefectly valid unicode chars and replacing them w/HTML entities or NCR-- that's going backwards. this will have repercussions for searching, etc. btw nolan, those chars are most likely not unicode but windows codepage, which is a sort of superset of iso-8859-1.

David Sparkman January 25, 2005

Is there any particular reason that you chose len+20 for the call to buf.ensureCapacity?

Stephen Cassady June 7, 2005

I've trying to run the script to pull out those nasty characters that stop Verity (MX7/k2) dead in it's tracks. While looking over some 1600 text documents, I keep getting a "500 null" error (that's all that shows up on the screen). When running a reduced set (like 30) it works perfectly fine. Any suggestions on what to do in moving forward and resolve this issue? Thanks.

Ryan Guill June 7, 2005

Stephen, if it works fine with a small set and you are getting a 500 error with a large one, it means the process is taking too long. Its timing out. Chances are, your function is being called recursively? You may be getting some neverending loop conditions, or you may just need to think some more about performance. You can also increase the timeout time in the administrator, but that should be a last ditch effort that needs to be avoided if possible.

pete January 9, 2006

Thanks for posting the code. Just saved my life in the 11th hour! Cheers, Pete (aka lad4bear)

felipe February 19, 2007

Can you show me how to use this function; Im new in coldfusion, and now I have RealBasic App that updates data in a mysql db. Im having problems to make RB save (á é í) instead I got: (Ã¡ Ã© Ã) characters, my though is to use your fucntion to display the right ones: á é í on the page until I found the way to make RB save the right coding. Hope some one show me how to use this fucntion in my cfquery to replace those character. Thanks (excuse my poor English)

Laurent January 22, 2010

Nice and clean. Perfect when your database is full of word pastes and you need to cfcontent that back to word! (for a wierd reason, utf-8 them seems not enough.) It just rocks! Thanks!

Vladimir Ugryumov March 23, 2011

This proved to be instantly useful for me today, 6 years later :) Just had to modify the substitution for the caught characters to fit my particular case - this is applied against file names, so had to avoid patterns like "{" Thanks, Pete!

Paul April 6, 2011

The function didn't seem to take care of pesky U+FFFF characters in my XML.