So we want to be able to sign some UTF8 encoded data for approvals in OpenXPKI. The data is contained in a form field and should be passed on to a SignedData object from CAPICOM. Unluckily, just doing SignedData.Content = theForm.text.value does not work. Apparently (did I mention that I barely know VBScript), everything is UTF-16LE internally, so even a simple SignedData.Content = "test" won't do and will sign a »t« (our guess is that CAPICOM assumes that what it gets is 0-terminated, so it stops reading after the UTF-16LE encoded »t«).
Some more or less extensive googling showed that one can go through the string using Mid() (VBScript's weird name for substr) and get the Unicode codepoint using AscW. The first idea was to encode these values in hex, add a delimiter and sign just that. But of course that's not really readable and not really space-efficient, too. Someone suggested that converting the whole string to UTF8 might be feasible from here — and offered the additional benefit of not having to code different validation methods on the server for different browsers, as Mozilla's crypto.signText() method already signs UTF8. I have to admit that although I've used it quite a bit, I've never looked at how UTF8 works. Luckily, it is actually quite easy (just some bitshifting). Some time later, the UnicodeToUTF8 method was done. It looked quite similar to this:
Function UnicodeToUTF8(ByRef pstrUnicode) ' Written 2007 by Alexander Klink for the OpenXPKI Project ' (c) 2007 by the OpenXPKI Project, released under the Apache License v2.0 ' converts a unicode string to UTF8 ' reference: http://en.wikipedia.org/wiki/UTF8 Dim i, result result = "" For i = 1 To Len(pStrUnicode) CurrentChar = Mid(PstrUnicode, i, 1) CodePoint = AscW(CurrentChar) MaskSixBits = 2^6 - 1 ' the lower 6 bits are 1 MaskFourBits = 2^4 - 1 ' the lower 4 bits are 1 MaskThreeBits = 2^3 - 1 ' the lower 3 bits are 1 MaskTwoBits = 2^2 - 1 ' the lower 3 bits are 1 'MsgBox CurrentChar & " : " & CodePoint If (CodePoint >= 0) And (CodePoint < 128) Then ' for codepoints < 128, just add one byte with the ' value of the codepoint (this is the ASCII subset) Zs = CodePoint result = result & ChrB(Zs) End If ' this is common for all of the following Zs = CodePoint And MaskSixBits If (CodePoint >= 128) And (CodePoint < 2048) Then ' for naming, see the Wikipedia article referenced above Ys = RightShift(CodePoint, 6) FirstByte = LeftShift(6, 5) Xor Ys ' 110yyyy SecondByte = LeftShift(2, 6) Xor Zs ' 10zzzzz 'MsgBox "Case 1: " & FirstByte & ", " & SecondByte result = result & ChrB(FirstByte) & ChrB(SecondByte) End If If (CodePoint >= 2048) And (CodePoint < 65536) Then Ys = RightShift(CodePoint, 6) And MaskSixBits Xs = RightShift(CodePoint, 12) And MaskFourBits FirstByte = LeftShift(14, 4) Xor Xs ' 1110xxxx SecondByte = LeftShift(2, 6) Xor Ys ' 10yyyyyy ThirdByte = LeftShift(2, 6) Xor Zs ' 10zzzzzz 'MsgBox "Case 2: " & FirstByte & ", " & SecondByte & ", " & ThirdByte result = result & ChrB(FirstByte) & ChrB(SecondByte) & ChrB(ThirdByte) End If If (CodePoint >= 65536) And (CodePoint < 1114112) Then Ws = RightShift(CodePoint, 18) And MaskThreeBits Xs = RightShift(CodePoint, 12) And MaskSixBits Ys = RightShift(CodePoint, 6) And MaskSixBits FirstByte = LeftShift(30, 3) Xor Ws ' 11110www SecondByte = LeftShift(2, 6) Xor Xs ' 10xxxxxx ThirdByte = LeftShift(2, 6) Xor Ys ' 10yyyyyy FourthByte = LeftShift(2, 6) Xor Zs ' 10zzzzzz 'MsgBox "Case 3: " & FirstByte & ", " & SecondByte & ", " & ThirdByte & FourthByte result = result & ChrB(FirstByte) & ChrB(SecondByte) & ChrB(ThirdByte) & ChrB(FourthByte) End If Next UnicodeToUTF8 = result End Function
Note that VBScript does not have left shifts or right shifts, so I had to implement these, too. Bummer. Not that it is particularly difficult, but what language does not have builtin shift operators?
Anyways, testing showed that it worked fine with both the german and the russian I18N data that we are using. But of course I wanted to be sure that it worked in the more obscure cases, too. I copied some four-byte UTF8 characters from fileformat.info and tried them out. Weirdly enough, that did not seem to work. Of course, at first I thought that it was my fault, but some debugging showed that AscW returned negative numbers for the corresponding characters. Negative numbers? What? Yes, AscW returns an Integer, which ranges from -32768 to +32767. So how can they accomodate Unicode codepoints, which go up to 1114111? Well, they can't. Microsoft even noticed that this might be a problem and published a Knowledgebase article about it. But did they really get the problem? Of course not, as they only talk about Unicode characters from 32768 to 65535 (I really like the quote »Unicode numbers occupy a 16-bit positive range from 0 to 65535 (0xFFFF)« – yeah, sure), for which AscW actually returns the one's complement of the number, so adding 65536 works. Anyhow, you're screwed with everything that is above 65535 (not that one reason for Unicode is to get rid of all the two-byte crap).
Here is the new version of UnicodeToUTF8, which probably should be called UnicodeToUTF8SortOf:
Function UnicodeToUTF8(ByRef pstrUnicode) ' Written 2007 by Alexander Klink for the OpenXPKI Project ' (c) 2007 by the OpenXPKI Project, released under the Apache License v2.0 ' converts a unicode string to UTF8 (well, sort of) ' reference: http://en.wikipedia.org/wiki/UTF8 Dim i, result result = "" For i = 1 To Len(pStrUnicode) CurrentChar = Mid(PstrUnicode, i, 1) CodePoint = AscW(CurrentChar) If (CodePoint < 0) Then ' AscW is broken. Badly. It can only return an integer, ' which is 32767 at most. So everything up to 65535 is ' AscW() + 65536. That Unicode chars exist beyond 65535 ' is apparently unknown to Microsoft ... CodePoint = CodePoint + 65536 End If MaskSixBits = 2^6 - 1 ' the lower 6 bits are 1 MaskFourBits = 2^4 - 1 ' the lower 4 bits are 1 MaskThreeBits = 2^3 - 1 ' the lower 3 bits are 1 MaskTwoBits = 2^2 - 1 ' the lower 3 bits are 1 'MsgBox CurrentChar & " : " & CodePoint If (CodePoint >= 0) And (CodePoint < 128) Then ' for codepoints < 128, just add one byte with the ' value of the codepoint (this is the ASCII subset) Zs = CodePoint result = result & ChrB(Zs) End If ' this is common for all of the following Zs = CodePoint And MaskSixBits If (CodePoint >= 128) And (CodePoint < 2048) Then ' for naming, see the Wikipedia article referenced above Ys = RightShift(CodePoint, 6) FirstByte = LeftShift(6, 5) Xor Ys ' 110yyyy SecondByte = LeftShift(2, 6) Xor Zs ' 10zzzzz 'MsgBox "Case 1: " & FirstByte & ", " & SecondByte result = result & ChrB(FirstByte) & ChrB(SecondByte) End If If (CodePoint >= 2048) And (CodePoint < 65536) Then Ys = RightShift(CodePoint, 6) And MaskSixBits Xs = RightShift(CodePoint, 12) And MaskFourBits FirstByte = LeftShift(14, 4) Xor Xs ' 1110xxxx SecondByte = LeftShift(2, 6) Xor Ys ' 10yyyyyy ThirdByte = LeftShift(2, 6) Xor Zs ' 10zzzzzz 'MsgBox "Case 2: " & FirstByte & ", " & SecondByte & ", " & ThirdByte result = result & ChrB(FirstByte) & ChrB(SecondByte) & ChrB(ThirdByte) End If Next UnicodeToUTF8 = result End Function
I still hope that in spite of all the ranting, the above will be useful to someone.