Function to Remove Extended Characters

coldfusion

I wrote a function today that I thought I would share. What it does is replace extended, or unicode characters with HTML/XML character entities. EG: the character à becomes à.

I wrote this for an RSS feed that had a few unicode characters in it, but the majority of the feed was us-ascii. Rather than changing the encoding, I opted to replace those few chars with an ascii safe XML representation.

Here's The function:

<cffunction name="EscapeExtendedChars" returntype="string">
	<cfargument name="str" type="string" required="true">
	<cfset var buf = CreateObject("java", "java.lang.StringBuffer")>
	<cfset var len = Len(arguments.str)>
	<cfset var char = "">
	<cfset var charcode = 0>
	<cfset buf.ensureCapacity(JavaCast("int", len+20))>
	<cfif NOT len>
		<cfreturn arguments.str>
	</cfif>
	<cfloop from="1" to="#len#" index="i">
		<cfset char = arguments.str.charAt(JavaCast("int", i-1))>
		<cfset charcode = JavaCast("int", char)>
		<cfif (charcode GT 31 AND charcode LT 127) OR charcode EQ 10
			OR charcode EQ 13 OR charcode EQ 9>
				<cfset buf.append(JavaCast("string", char))>
		<cfelse>
			<cfset buf.append(JavaCast("string", "&##"))>
			<cfset buf.append(JavaCast("string", charcode))>
			<cfset buf.append(JavaCast("string", ";"))>
		</cfif>
	</cfloop>
	<cfreturn buf.toString()>
</cffunction>

I'm making use of Java's StringBuffer class, and also the charAt method of java.lang.String. I think this code is a pretty fast solution, since it avoid appending strings by hand, and I would guess the charAt method may be a bit faster than using Mid.


3 people found this page useful, what do you think?

 Download FuseGuard WAF for ColdFusion

Trackbacks

Trackback Address: 202/0CEE12EF34B55D6BF72FF00B876540EE

Comments

On 01/21/2005 at 6:49:15 PM UTC Ryan Guill wrote:
1
You should submit this to cflib, it looks like it will be quite useful.

On 01/21/2005 at 7:42:20 PM UTC Nolan wrote:
2
this is great! i have lots of Marketing folks that like to copy/paste from MS Word into my CMS engine. Now I have an extra level of formatting things correctly. thanks!

On 01/21/2005 at 11:36:46 PM UTC paulh wrote:
3
can't say that i agree with ripping out prefectly valid unicode chars and replacing them w/HTML entities or NCR-- that's going backwards. this will have repercussions for searching, etc.

btw nolan, those chars are most likely not unicode but windows codepage, which is a sort of superset of iso-8859-1.

On 01/22/2005 at 10:02:07 AM UTC sporter wrote:
4
For a different approach, check out DeMoronize()

http://www.cflib.org/udf.cfm?ID=725

On 01/26/2005 at 12:39:32 AM UTC David Sparkman wrote:
5
Is there any particular reason that you chose len+20 for the call to buf.ensureCapacity?

On 01/27/2005 at 5:14:30 AM UTC Bjorn Jensen wrote:
6
I tried the following code, and it seems that 95% of the times I run this (mostly on initial compile it's different), then a normal string concatenation is faster, can anyone else verify this or let me know if I'm doing something wrong in my test ?

<cfsilent> <cfset stra = ""> <cfset strb = CreateObject("java", "java.lang.StringBuffer")>

<cfset a = gettickcount()>

<cfloop from="1" to="10000" index="I"> <cfset stra = stra & "a"> </cfloop>

<cfset b = gettickcount()>

<cfset c = gettickcount()>

<cfloop from="1" to="10000" index="I"> <cfset strb.append(JavaCast("string", "a"))> </cfloop>

<cfset d = gettickcount()>

</cfsilent><cfoutput>result: #b-a# - #d-c#<br></cfoutput>

On 06/07/2005 at 6:15:50 PM UTC Stephen Cassady wrote:
7
I've trying to run the script to pull out those nasty characters that stop Verity (MX7/k2) dead in it's tracks. While looking over some 1600 text documents, I keep getting a "500 null" error (that's all that shows up on the screen). When running a reduced set (like 30) it works perfectly fine.

Any suggestions on what to do in moving forward and resolve this issue? Thanks.

On 06/07/2005 at 6:23:09 PM UTC Ryan Guill wrote:
8
Stephen, if it works fine with a small set and you are getting a 500 error with a large one, it means the process is taking too long. Its timing out. Chances are, your function is being called recursively? You may be getting some neverending loop conditions, or you may just need to think some more about performance. You can also increase the timeout time in the administrator, but that should be a last ditch effort that needs to be avoided if possible.

On 01/09/2006 at 10:55:07 PM UTC pete wrote:
9
Thanks for posting the code. Just saved my life in the 11th hour!

Cheers, Pete (aka lad4bear)

On 02/20/2007 at 12:16:47 AM UTC felipe wrote:
10
Can you show me how to use this function; Im new in coldfusion, and now I have RealBasic App that updates data in a mysql db. Im having problems to make RB save (á é í) instead I got: (á é í) characters, my though is to use your fucntion to display the right ones: á é í on the page until I found the way to make RB save the right coding.

Hope some one show me how to use this fucntion in my cfquery to replace those character.

Thanks

(excuse my poor English)

On 01/22/2010 at 6:00:50 AM UTC Laurent wrote:
11
Nice and clean. Perfect when your database is full of word pastes and you need to cfcontent that back to word! (for a wierd reason, utf-8 them seems not enough.) It just rocks! Thanks!

On 03/23/2011 at 2:40:40 PM UTC Vladimir Ugryumov wrote:
12
This proved to be instantly useful for me today, 6 years later :) Just had to modify the substitution for the caught characters to fit my particular case - this is applied against file names, so had to avoid patterns like "&#123;"

Thanks, Pete!

On 04/06/2011 at 10:03:08 PM UTC Paul wrote:
13
The function didn't seem to take care of pesky U+FFFF characters in my XML.

Post a Comment




  



Spell Checker by Foundeo

Recent Entries



foundeo


did you hack my cf?