pf » Function to Remove Extended Characters

Function to Remove Extended Characters

coldfusion

I wrote a function today that I thought I would share. What it does is replace extended, or unicode characters with HTML/XML character entities. EG: the character à becomes à.

I wrote this for an RSS feed that had a few unicode characters in it, but the majority of the feed was us-ascii. Rather than changing the encoding, I opted to replace those few chars with an ascii safe XML representation.

Here's The function:

<cffunction name="EscapeExtendedChars" returntype="string">
	<cfargument name="str" type="string" required="true">
	<cfset var buf = CreateObject("java", "java.lang.StringBuffer")>
	<cfset var len = Len(arguments.str)>
	<cfset var char = "">
	<cfset var charcode = 0>
	<cfset buf.ensureCapacity(JavaCast("int", len+20))>
	<cfif NOT len>
		<cfreturn arguments.str>
	</cfif>
	<cfloop from="1" to="#len#" index="i">
		<cfset char = arguments.str.charAt(JavaCast("int", i-1))>
		<cfset charcode = JavaCast("int", char)>
		<cfif (charcode GT 31 AND charcode LT 127) OR charcode EQ 10
			OR charcode EQ 13 OR charcode EQ 9>
				<cfset buf.append(JavaCast("string", char))>
		<cfelse>
			<cfset buf.append(JavaCast("string", "&##"))>
			<cfset buf.append(JavaCast("string", charcode))>
			<cfset buf.append(JavaCast("string", ";"))>
		</cfif>
	</cfloop>
	<cfreturn buf.toString()>
</cffunction>

I'm making use of Java's StringBuffer class, and also the charAt method of java.lang.String. I think this code is a pretty fast solution, since it avoid appending strings by hand, and I would guess the charAt method may be a bit faster than using Mid.


3 people found this page useful, what do you think?

Trackback Address: 202/0CEE12EF34B55D6BF72FF00B876540EE
On 01/21/2005 at 6:49:15 PM MST Ryan Guill wrote:
1
You should submit this to cflib, it looks like it will be quite useful.

On 01/21/2005 at 7:42:20 PM MST Nolan wrote:
2
this is great! i have lots of Marketing folks that like to copy/paste from MS Word into my CMS engine. Now I have an extra level of formatting things correctly. thanks!

On 01/21/2005 at 11:36:46 PM MST paulh wrote:
3
can't say that i agree with ripping out prefectly valid unicode chars and replacing them w/HTML entities or NCR-- that's going backwards. this will have repercussions for searching, etc.

btw nolan, those chars are most likely not unicode but windows codepage, which is a sort of superset of iso-8859-1.

On 01/22/2005 at 10:02:07 AM MST sporter wrote:
4
For a different approach, check out DeMoronize()

http://www.cflib.org/udf.cfm?ID=725

On 01/26/2005 at 12:39:32 AM MST David Sparkman wrote:
5
Is there any particular reason that you chose len+20 for the call to buf.ensureCapacity?

On 01/27/2005 at 5:14:30 AM MST Bjorn Jensen wrote:
6
I tried the following code, and it seems that 95% of the times I run this (mostly on initial compile it's different), then a normal string concatenation is faster, can anyone else verify this or let me know if I'm doing something wrong in my test ?

<cfsilent> <cfset stra = ""> <cfset strb = CreateObject("java", "java.lang.StringBuffer")>

<cfset a = gettickcount()>

<cfloop from="1" to="10000" index="I"> <cfset stra = stra & "a"> </cfloop>

<cfset b = gettickcount()>

<cfset c = gettickcount()>

<cfloop from="1" to="10000" index="I"> <cfset strb.append(JavaCast("string", "a"))> </cfloop>

<cfset d = gettickcount()>

</cfsilent><cfoutput>result: #b-a# - #d-c#<br></cfoutput>

On 06/07/2005 at 6:15:50 PM MDT Stephen Cassady wrote:
7
I've trying to run the script to pull out those nasty characters that stop Verity (MX7/k2) dead in it's tracks. While looking over some 1600 text documents, I keep getting a "500 null" error (that's all that shows up on the screen). When running a reduced set (like 30) it works perfectly fine.

Any suggestions on what to do in moving forward and resolve this issue? Thanks.

On 06/07/2005 at 6:23:09 PM MDT Ryan Guill wrote:
8
Stephen, if it works fine with a small set and you are getting a 500 error with a large one, it means the process is taking too long. Its timing out. Chances are, your function is being called recursively? You may be getting some neverending loop conditions, or you may just need to think some more about performance. You can also increase the timeout time in the administrator, but that should be a last ditch effort that needs to be avoided if possible.

On 01/09/2006 at 10:55:07 PM MST pete wrote:
9
Thanks for posting the code. Just saved my life in the 11th hour!

Cheers, Pete (aka lad4bear)

On 02/20/2007 at 12:16:47 AM MST felipe wrote:
10
Can you show me how to use this function; Im new in coldfusion, and now I have RealBasic App that updates data in a mysql db. Im having problems to make RB save (á é í) instead I got: (á é í) characters, my though is to use your fucntion to display the right ones: á é í on the page until I found the way to make RB save the right coding.

Hope some one show me how to use this fucntion in my cfquery to replace those character.

Thanks

(excuse my poor English)




  



Spell Checker by Foundeo





Subscribe to my RSS Feed: solosub RSS
Tags