Function to Remove Extended Characters

Published on January 21, 2005
By Pete Freitag

I wrote a ColdFusion function today that I thought I would share. What it does is replace extended, or unicode characters with HTML/XML character entities. EG: the character à becomes à.

I wrote this for an RSS feed that had a few unicode characters in it, but the majority of the feed was us-ascii. Rather than changing the encoding, I opted to replace those few chars with an ascii safe XML representation.

Here's The function:

<cffunction name="EscapeExtendedChars" returntype="string">
	<cfargument name="str" type="string" required="true">
	<cfset var buf = CreateObject("java", "java.lang.StringBuffer")>
	<cfset var len = Len(arguments.str)>
	<cfset var char = "">
	<cfset var charcode = 0>
	<cfset buf.ensureCapacity(JavaCast("int", len+20))>
	<cfif NOT len>
		<cfreturn arguments.str>
	</cfif>
	<cfloop from="1" to="#len#" index="i">
		<cfset char = arguments.str.charAt(JavaCast("int", i-1))>
		<cfset charcode = JavaCast("int", char)>
		<cfif (charcode GT 31 AND charcode LT 127) OR charcode EQ 10
			OR charcode EQ 13 OR charcode EQ 9>
				<cfset buf.append(JavaCast("string", char))>
		<cfelse>
			<cfset buf.append(JavaCast("string", "&##"))>
			<cfset buf.append(JavaCast("string", charcode))>
			<cfset buf.append(JavaCast("string", ";"))>
		</cfif>
	</cfloop>
	<cfreturn buf.toString()>
</cffunction>

I'm making use of Java's StringBuffer class, and also the charAt method of java.lang.String. I think this code is a pretty fast solution, since it avoid appending strings by hand, and I would guess the charAt method may be a bit faster than using the builtin CFML Mid function.

Function to Remove Extended Characters was first published on January 21, 2005.

The FuseGuard Web Application Firewall for ColdFusion & CFML is a high performance, customizable engine that blocks various attacks against your ColdFusion applications.

CFBreak
The weekly newsletter for the CFML Community

Comments

You should submit this to cflib, it looks like it will be quite useful.

by Ryan Guill on 01/21/2005 at 6:49:15 PM UTC

this is great! i have lots of Marketing folks that like to copy/paste from MS Word into my CMS engine. Now I have an extra level of formatting things correctly. thanks!

by Nolan on 01/21/2005 at 7:42:20 PM UTC

can't say that i agree with ripping out prefectly valid unicode chars and replacing them w/HTML entities or NCR-- that's going backwards. this will have repercussions for searching, etc.

btw nolan, those chars are most likely not unicode but windows codepage, which is a sort of superset of iso-8859-1.

by paulh on 01/21/2005 at 11:36:46 PM UTC

Is there any particular reason that you chose len+20 for the call to buf.ensureCapacity?

by David Sparkman on 01/26/2005 at 12:39:32 AM UTC

I've trying to run the script to pull out those nasty characters that stop Verity (MX7/k2) dead in it's tracks. While looking over some 1600 text documents, I keep getting a "500 null" error (that's all that shows up on the screen). When running a reduced set (like 30) it works perfectly fine.

Any suggestions on what to do in moving forward and resolve this issue? Thanks.

by Stephen Cassady on 06/07/2005 at 6:15:50 PM UTC

Stephen, if it works fine with a small set and you are getting a 500 error with a large one, it means the process is taking too long. Its timing out. Chances are, your function is being called recursively? You may be getting some neverending loop conditions, or you may just need to think some more about performance. You can also increase the timeout time in the administrator, but that should be a last ditch effort that needs to be avoided if possible.

by Ryan Guill on 06/07/2005 at 6:23:09 PM UTC

Thanks for posting the code. Just saved my life in the 11th hour!

Cheers, Pete (aka lad4bear)

by pete on 01/09/2006 at 10:55:07 PM UTC

Can you show me how to use this function; Im new in coldfusion, and now I have RealBasic App that updates data in a mysql db. Im having problems to make RB save (á é í) instead I got: (Ã¡ Ã© Ã) characters, my though is to use your fucntion to display the right ones: á é í on the page until I found the way to make RB save the right coding.

Hope some one show me how to use this fucntion in my cfquery to replace those character.

Thanks

(excuse my poor English)

by felipe on 02/20/2007 at 12:16:47 AM UTC

Nice and clean. Perfect when your database is full of word pastes and you need to cfcontent that back to word! (for a wierd reason, utf-8 them seems not enough.) It just rocks! Thanks!

by Laurent on 01/22/2010 at 6:00:50 AM UTC

This proved to be instantly useful for me today, 6 years later :) Just had to modify the substitution for the caught characters to fit my particular case - this is applied against file names, so had to avoid patterns like "{"

Thanks, Pete!

by Vladimir Ugryumov on 03/23/2011 at 2:40:40 PM UTC

The function didn't seem to take care of pesky U+FFFF characters in my XML.

by Paul on 04/06/2011 at 10:03:08 PM UTC

Function to Remove Extended Characters

CFBreak The weekly newsletter for the CFML Community

Comments

CFBreak
The weekly newsletter for the CFML Community