BYR Achieve · 镜像论坛

如何升级基于STL的应用来支持Unicode http://dozb.blogchina.com/1655050.html 翻译作者：dozb,Nicole 译注：注意，本文仅仅适合于MSVC环境中STL库，对于STLPort有问题原作者：Taka Muraoka 原出处：http://www.codeproject.com/vcpp/stl/upgradingstlappstounicode.asp 介绍我最近升级一个想当大的程序，目的是用Unicode代替single-byte 字符。除了少数遗留下来的模块，我忠实地使用t-functions并且用_T()宏包裹我的字符串和字符常量，众所周知这能安全的转换成Unicode，我要做的事情是定义UNICODE 和 _UNICODE，我祈祷所有事情将如我所愿的工作。天啊，我是多么地错误:(( 因此，我写这篇文章是为了治疗两周工作之痛，并且希望解除其他人的痛苦，这痛苦是我已经经受的。唉... 基础理论上，写出用single- 或 double-字节字符能被编译的代码是直接的。我曾经想在这里写一节，但是Chris Maunder 已经写了 done it. 他描述的技术是广为人知的，因此对理解这篇文章的内容非常有帮助。 Wide 文件 I/O 这里是stream类的wide版本，它容易地定义t-风格的宏去管理他们：你将像这样用它们： tofstream testFile( "test.txt" ) ; testFile << _T("ABC") ; 现在，你期待的结果是，当用single-byte 字符编译的时候，执行代码将生成3字节的文件，当用double-byte 字符编译的时候，执行代码将生成6字节的文件。但是你错了，都是3字节的文件。到底怎么啦？这渊源是标准C++的规定，wide流当写到 file。必须转换double-byte 到single-byte 。如上例，宽字符串L"ABC"(有6个字节长)，当写到文件前，被转换成窄字符串(3字节)。更坏的情况，如何转换由库的实现来决定的( implementation-dependent)。我不能找出一个确切的解释，为什么事情会弄成这样子。我猜测，文件被定义为考虑作为字符（single-byte）流。若允许同时写2字节的字符将无法提取。不管对还是错，这都导致严重的问题。例如，你不能写二进制数据到wofstream，因为这个类试图在输出前先窄字符化它。这对我是明显的问题，因为我有大量的函数像这样写： void outputStuff( tostream& os ) { // output stuff to the stream os << .... } 假如你传递的是tstringstream 对象将没有问题（例如,它流出宽字符），但是假如你传递的是tofstream 将得到怪异的结果（因为所有内容都被窄化了）。 Wide 文件 I/O: 解决方案用调试器单步跟踪STL，结果发现wofstream 在写输出到文件以前，调用std::codecvt 对象来窄化输出的数据。std::codecvt对象是造成字符串从一种字符集到另一种字符集转换的原因。C++要求作为标准提供：1、转换chars 到 chars（例如，费力地什么也不做），2、转换wchar_ts 到chars。后一种就是引起我们这么多伤心事的原因。解决方案：写一个新的继承自codecvt的类，用来转换wchar_ts 到 wchar_ts（什么也不做），绑定到wofstream 对象中。当wofstream 试图转换它所输出的数据时，它将调用我们新的codecvt 对象，实际上什么也不做，不改变地写输出数据。在google groups浏览找一些P. J. Plauger写的代码 code （是MSVC环境中STL库的作者），但是用 Stlport 4.5.3 编译还是有问题。这是最后敲定的版本： #include // nb: MSVC6+Stlport can't handle "std::" // appearing in the NullCodecvtBase typedef. using std::codecvt ; typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ; class NullCodecvt : public NullCodecvtBase { public: typedef wchar_t _E ; typedef char _To ; typedef mbstate_t _St ; explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { } protected: virtual result do_in( _St& _State , const _To* _F1 , const _To* _L1 , const _To*& _Mid1 , _E* F2 , _E* _L2 , _E*& _Mid2 ) const { return noconv ; } virtual result do_out( _St& _State , const _E* _F1 , const _E* _L1 , const _E*& _Mid1 , _To* F2, _E* _L2 , _To*& _Mid2 ) const { return noconv ; } virtual result do_unshift( _St& _State , _To* _F2 , _To* _L2 , _To*& _Mid2 ) const { return noconv ; } virtual int do_length( _St& _State , const _To* _F1 , const _To* _L1 , size_t _N2 ) const _THROW0() { return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ; } virtual bool do_always_noconv() const _THROW0() { return true ; } virtual int do_max_length() const _THROW0() { return 2 ; } virtual int do_encoding() const _THROW0() { return 2 ; } } ; 你能看得出这些函数都是空架子，实际上什么也不做，仅仅返回noconv 指示而已。剩下要做的仅仅是把其实例化，并连接到wofstream 对象中。用MSVC，假定你用_ADDFAC() 宏（非标准的）来imbue一个locale到对象。可是它不能和我的新的NullCodecvt类工作，因此我绕过这个宏，写一个新的来代替： #define IMBUE_NULL_CODECVT( outputFile ) \ { \ NullCodecvt* pNullCodecvt = new NullCodecvt ; \ locale loc = locale::classic() ; \ loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \ (outputFile).imbue( loc ) ; \ } 好，上面给出的不能好好工作的例子代码，现在能这样写： tofstream testFile ; IMBUE_NULL_CODECVT( testFile ) ; testFile.open( "test.txt" , ios::out | ios::binary ) ; testFile << _T("ABC") ; 重要的是必须是在打开文件前，文件流对象要用新的codecvt对象imbue。文件也必须用binary模式打开。假如不是这种模式，每次文件看一个宽字符的高位或低位是10的时候，它将进行既定的CR/LF翻译，结果不是你想要的。假如你真的想要CR/LF序列，你可以明确地插入"\r\n"来代替std::endl。 wchar_t 问题 wchar_t 是宽字符的类型，其定义如下: typedef unsigned short wchar_t ; 不幸的是，因为它用typedef 代替真正的C++类型，这样定义有一个棘手的缺点：你不能重载它。看下面的代码： TCHAR ch = _T('A') ; tcout << ch << endl ; 用窄字符串，正如你期望的：打印出字符A。用宽字符，它打印出65。编译器决定出，你正在流出一个unsigned short 并且把它作为数字值来代替宽字符来打印它。哈哈!!!找出在你流出特别的字符的地方并修正它，比起贯串你整个代码的基础，这不是办法。我写了一个小函数，使得情况好一些： #ifdef _UNICODE // NOTE: Can't stream out wchar_t's - convert to a string first! inline std::wstring toStreamTchar( wchar_t ch ) { return std::wstring(&ch,1) ; } #else // NOTE: It's safe to stream out narrow char's directly. inline char toStreamTchar( char ch ) { return ch ; } #endif // _UNICODE TCHAR ch = _T('A') ; tcout << toStreamTchar(ch) << endl ; Wide 异常类多数C++程序用异常来捕获错误的发生。不幸地，std::exception 被定义成这个样子： class std::exception { // ... virtual const char *what() const throw() ; } ; 仅仅能捕获窄字符的错误信息。我曾经throw自己定义的或std::runtime_error的异常，因此我写了一个std::runtime_error 的版本如下： class wruntime_error : public std::runtime_error { public: // --- PUBLIC INTERFACE --- // constructors: wruntime_error( const std::wstring& errorMsg ) ; // copy/assignment: wruntime_error( const wruntime_error& rhs ) ; wruntime_error& operator=( const wruntime_error& rhs ) ; // destructor: virtual ~wruntime_error() ; // exception methods: const std::wstring& errorMsg() const ; private: // --- DATA MEMBERS --- // data members: std::wstring mErrorMsg ; ///< Exception error message. } ; #ifdef _UNICODE #define truntime_error wruntime_error #else #define truntime_error runtime_error #endif // _UNICODE /* -------------------------------------------------------------------- */ wruntime_error::wruntime_error( const wstring& errorMsg ) : runtime_error( toNarrowString(errorMsg) ) , mErrorMsg(errorMsg) { // NOTE: We give the runtime_error base the narrow version of the // error message. This is what will get shown if what() is called. // The wruntime_error inserter or errorMsg() should be used to get // the wide version. } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ wruntime_error::wruntime_error( const wruntime_error& rhs ) : runtime_error( toNarrowString(rhs.errorMsg()) ) , mErrorMsg(rhs.errorMsg()) { } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ wruntime_error& wruntime_error::operator=( const wruntime_error& rhs ) { // copy the wruntime_error runtime_error::operator=( rhs ) ; mErrorMsg = rhs.mErrorMsg ; return *this ; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ wruntime_error::~wruntime_error() { } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; } (toNarrowString() 是一个小函数用来转换宽字符到窄字符，下面会给出). wruntime_error 简单地保存宽错误信息自身的一个拷贝，并且为适应有人调用what()，给出一个基于std::exception的窄版本。我定义的异常类，如下： class MyExceptionClass : public std::truntime_error { public: MyExceptionClass( const std::tstring& errorMsg ) : std::truntime_error(errorMsg) { } } ; 最后的问题是我有大量的代码看起来如下： try { // do something... } catch( exception& xcptn ) { tstringstream buf ; buf << _T("An error has occurred: ") << xcptn ; AfxMessageBox( buf.str().c_str() ) ; } 我已经定义了一个std::exception的插入者，如下： tostream& operator<<( tostream& os , const exception& xcptn ) { // insert the exception // NOTE: toTstring() converts a string to a tstring - defined below os << toTstring( xcptn.what() ) ; return os ; } 问题是我的插入者调用what(),其仅仅返回窄版本的错误信息。但是假如错误信息包含外国字符，我想看他们在错误对话框。因此我重写了插入者如下： tostream& operator<<( tostream& os , const exception& xcptn ) { // insert the exception if ( const wruntime_error* p = dynamic_cast<const wruntime_error*>(&xcptn) ) os << p->errorMsg() ; else os << toTstring( xcptn.what() ) ; return os ; } 现在，它检测是否给的是一个宽异常类，假如是，流出宽错误信息。否则它用标准的窄错误信息取回。即使我可以专门用truntime_error起源的类在我的应用中，后面的情况仍然是重要的，因为STL或其他第三方库可以throw 来自std::exception的错误。其他各种问题 Q100639: 假如你在MFC中使用Unicode,你需要指定wWinMainCRTStartup 作为你的进入点（在你的Project Options中的Link页面里）。许多windows函数接受一个buffer来在里面返回其结果。buffer大小通常以字符多少指定，非字节。因此下面的代码用single-byte 编译的时候工作良好： // get our EXE name TCHAR buf[ _MAX_PATH+1 ] ; GetModuleFileName( NULL , buf , sizeof(buf) ) ; double-byte 字符的时候将发生错误。调用GetModuleFileName()需要这么写： GetModuleFileName( NULL , buf , sizeof(buf)/sizeof(TCHAR) ) ; 假如你一个一个字节地处理文件的时候，你需要测试WEOF, 而不是 EOF。在发送前，HttpSendRequest() 接收一个字符串，用来指定附加的头绑定到HTTP请求。ANSI建造接收一个长度为-1的字符串意味着头字符是以NULL结束的。Unicode 建造需要字符串的长度必须明确提供。不要问我为什么。各种有用的东东最后，假如你做类似工作，一些小函数对你来说可能有用： extern std::wstring toWideString( const char* pStr , int len=-1 ) ; inline std::wstring toWideString( const std::string& str ) { return toWideString(str.c_str(),str.length()) ; } inline std::wstring toWideString( const wchar_t* pStr , int len=-1 ) { return (len < 0) ? pStr : std::wstring(pStr,len) ; } inline std::wstring toWideString( const std::wstring& str ) { return str ; } extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; inline std::string toNarrowString( const std::wstring& str ) { return toNarrowString(str.c_str(),str.length()) ; } inline std::string toNarrowString( const char* pStr , int len=-1 ) { return (len < 0) ? pStr : std::string(pStr,len) ; } inline std::string toNarrowString( const std::string& str ) { return str ; } #ifdef _UNICODE inline TCHAR toTchar( char ch ) { return (wchar_t)ch ; } inline TCHAR toTchar( wchar_t ch ) { return ch ; } inline std::tstring toTstring( const std::string& s ) { return toWideString(s) ; } inline std::tstring toTstring( const char* p , int len=-1 ) { return toWideString(p,len) ; } inline std::tstring toTstring( const std::wstring& s ) { return s ; } inline std::tstring toTstring( const wchar_t* p , int len=-1 ) { return (len < 0) ? p : std::wstring(p,len) ; } #else inline TCHAR toTchar( char ch ) { return ch ; } inline TCHAR toTchar( wchar_t ch ) { return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ; } inline std::tstring toTstring( const std::string& s ) { return s ; } inline std::tstring toTstring( const char* p , int len=-1 ) { return (len < 0) ? p : std::string(p,len) ; } inline std::tstring toTstring( const std::wstring& s ) { return toNarrowString(s) ; } inline std::tstring toTstring( const wchar_t* p , int len=-1 ) { return toNarrowString(p,len) ; } #endif // _UNICODE /* -------------------------------------------------------------------- */ wstring toWideString( const char* pStr , int len ) { ASSERT_PTR( pStr ) ; ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; // figure out how many wide characters we are going to get int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; if ( len == -1 ) -- nChars ; if ( nChars == 0 ) return L"" ; // convert the narrow string to a wide string // nb: slightly naughty to write directly into the string like this wstring buf ; buf.resize( nChars ) ; MultiByteToWideChar( CP_ACP , 0 , pStr , len , const_cast(buf.c_str()) , nChars ) ; return buf ; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ string toNarrowString( const wchar_t* pStr , int len ) { ASSERT_PTR( pStr ) ; ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; // figure out how many narrow characters we are going to get int nChars = WideCharToMultiByte( CP_ACP , 0 , pStr , len , NULL , 0 , NULL , NULL ) ; if ( len == -1 ) -- nChars ; if ( nChars == 0 ) return "" ; // convert the wide string to a narrow string // nb: slightly naughty to write directly into the string like this string buf ; buf.resize( nChars ) ; WideCharToMultiByte( CP_ACP , 0 , pStr , len , const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; return buf ; }

[转帖]如何升级基于STL的应用来支持Unicode