Function to Remove Extended Characters

January 21, 2005

I wrote a function today that I thought I would share. What it does is replace extended, or unicode characters with HTML/XML character entities. EG: the character à becomes à.

I wrote this for an RSS feed that had a few unicode characters in it, but the majority of the feed was us-ascii. Rather than changing the encoding, I opted to replace those few chars with an ascii safe XML representation.

Here's The function:

<cffunction name="EscapeExtendedChars" returntype="string">
	<cfargument name="str" type="string" required="true">
	<cfset var buf = CreateObject("java", "java.lang.StringBuffer")>
	<cfset var len = Len(arguments.str)>
	<cfset var char = "">
	<cfset var charcode = 0>
	<cfset buf.ensureCapacity(JavaCast("int", len+20))>
	<cfif NOT len>
		<cfreturn arguments.str>
	<cfloop from="1" to="#len#" index="i">
		<cfset char = arguments.str.charAt(JavaCast("int", i-1))>
		<cfset charcode = JavaCast("int", char)>
		<cfif (charcode GT 31 AND charcode LT 127) OR charcode EQ 10
			OR charcode EQ 13 OR charcode EQ 9>
				<cfset buf.append(JavaCast("string", char))>
			<cfset buf.append(JavaCast("string", "&##"))>
			<cfset buf.append(JavaCast("string", charcode))>
			<cfset buf.append(JavaCast("string", ";"))>
	<cfreturn buf.toString()>

I'm making use of Java's StringBuffer class, and also the charAt method of java.lang.String. I think this code is a pretty fast solution, since it avoid appending strings by hand, and I would guess the charAt method may be a bit faster than using Mid.

3 people found this page useful, what do you think?


You should submit this to cflib, it looks like it will be quite useful.
this is great! i have lots of Marketing folks that like to copy/paste from MS Word into my CMS engine. Now I have an extra level of formatting things correctly. thanks!
can't say that i agree with ripping out prefectly valid unicode chars and replacing them w/HTML entities or NCR-- that's going backwards. this will have repercussions for searching, etc. btw nolan, those chars are most likely not unicode but windows codepage, which is a sort of superset of iso-8859-1.
For a different approach, check out DeMoronize()
Is there any particular reason that you chose len+20 for the call to buf.ensureCapacity?
I tried the following code, and it seems that 95% of the times I run this (mostly on initial compile it's different), then a normal string concatenation is faster, can anyone else verify this or let me know if I'm doing something wrong in my test ? <cfsilent> <cfset stra = ""> <cfset strb = CreateObject("java", "java.lang.StringBuffer")> <cfset a = gettickcount()> <cfloop from="1" to="10000" index="I"> <cfset stra = stra & "a"> </cfloop> <cfset b = gettickcount()> <cfset c = gettickcount()> <cfloop from="1" to="10000" index="I"> <cfset strb.append(JavaCast("string", "a"))> </cfloop> <cfset d = gettickcount()> </cfsilent><cfoutput>result: #b-a# - #d-c#<br></cfoutput>
I've trying to run the script to pull out those nasty characters that stop Verity (MX7/k2) dead in it's tracks. While looking over some 1600 text documents, I keep getting a "500 null" error (that's all that shows up on the screen). When running a reduced set (like 30) it works perfectly fine. Any suggestions on what to do in moving forward and resolve this issue? Thanks.
Stephen, if it works fine with a small set and you are getting a 500 error with a large one, it means the process is taking too long. Its timing out. Chances are, your function is being called recursively? You may be getting some neverending loop conditions, or you may just need to think some more about performance. You can also increase the timeout time in the administrator, but that should be a last ditch effort that needs to be avoided if possible.
Thanks for posting the code. Just saved my life in the 11th hour! Cheers, Pete (aka lad4bear)
Can you show me how to use this function; Im new in coldfusion, and now I have RealBasic App that updates data in a mysql db. Im having problems to make RB save (á é í) instead I got: (á é í) characters, my though is to use your fucntion to display the right ones: á é í on the page until I found the way to make RB save the right coding. Hope some one show me how to use this fucntion in my cfquery to replace those character. Thanks (excuse my poor English)
Nice and clean. Perfect when your database is full of word pastes and you need to cfcontent that back to word! (for a wierd reason, utf-8 them seems not enough.) It just rocks! Thanks!
This proved to be instantly useful for me today, 6 years later :) Just had to modify the substitution for the caught characters to fit my particular case - this is applied against file names, so had to avoid patterns like "&#123;"

Thanks, Pete!
The function didn't seem to take care of pesky U+FFFF characters in my XML.

Recent Entries


did you hack my cf?