[ID3 Dev] Unicode

Sun Feb 11 19:36:58 PST 2007

Dale, I will draw on my tiny and recent study of Unicode:

You're right, it can be either big- or little-endian, so to be sure  
that it is decoded correctly, a byte order marker (BOM) can be  
prepended. This will be two bytes : 0xFF 0xFE for big-endian, or 0xFE  
0xFF for little-endian.

That said, there is UTF-16BE, which is always big-endian, and  
UTF-16LE which is always , you guessed it, little-endian. They don't  
need a BOM, and in fact forbid it.

The ID3v2.3 spec only allows for iso-8859-1 (the encoding byte set to  
0x00) and UCS-2 which is essentially UTF-16 (encoding byte set to  
0x01), so for completeness, we need to add the BOM when writing  
Unicode strings.

There is also UTF-8, which (I may be wrong about this) always has  
it's multi-byte characters in big-endian order.

Wikipedia explains in a great deal more detail.

Best,

Mark

On 12 Feb 2007, at 03:13, Dale Preston wrote:

> Can't Unicode use either little-endian or big-endian?
>
> This is an area where I have no experience to draw on and has  
> definitely
> been a struggle in my library.
>
> Dale
>
>
> -----Original Message-----
> From: Jud White [mailto:jwhite at cdtag.com]
> Sent: Sunday, February 11, 2007 8:32 PM
> To: id3v2 at id3.org
> Subject: Re: [ID3 Dev] Unicode
>
> Just tested this.. if you're writing the BOM reversed (should be 0xFF
> 0xFE) you'll get oriental characters in iTunes.
>
> Jud White wrote:
>> Mark,
>>
>> This isn't a UCS-2 vs UTF-16 issue.  The differences in these two  
>> only
>> occur over 0xffff.  Also it's not an issue with BOM since iTunes can
>> cope without BOM.
>>
>> I was able to reproduce this behavior by writing a text encoding byte
>> of "Unicode" (0x01) but writing the actual string in UTF-8.  Maybe
>> your implementation is doing something similar?
>>
>> -Jud
>>
>>
>>
>> Mark Smith wrote:
>>> I'm getting a bit exasperated with trying to handle Unicode
>>> correctly. In my library, I'm handling all strings as UTF8
>>> internally, but since the 2.3 spec (as I've understood it) only
>>> allows for iso 8559-1 and UCS-2 (for the moment I'm treating  
>>> UCS-2 as
>>> if it were UTF-16), I'm writing out as UTF-16, where necessary.
>>>
>>> What I'm finding is that if I write out a TALB frame as  
>>> "Erét" (thats
>>> E - r - e with acute accent - t, if your mail client displays
>>> something else) as UTF-16, iTunes and the other two tagging apps  
>>> I've
>>> checked out display it in an oriental font.
>>>
>>> So the question is, am I wrong, or are other people just not
>>> bothering to deal with anything but english?
>>>
>>> Any insights gratefully recieved....
>>>
>>> Thanks,
>>>
>>> Mark
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: id3v2-unsubscribe at id3.org
>>> For additional commands, e-mail: id3v2-help at id3.org
>>>
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: id3v2-unsubscribe at id3.org
>> For additional commands, e-mail: id3v2-help at id3.org
>>
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: id3v2-unsubscribe at id3.org
> For additional commands, e-mail: id3v2-help at id3.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: id3v2-unsubscribe at id3.org
> For additional commands, e-mail: id3v2-help at id3.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: id3v2-unsubscribe at id3.org
For additional commands, e-mail: id3v2-help at id3.org