[C++]: Convert a MBCS string to UTF8 in standard c++

Let me tell you how I came to write this article.
The other day one of my colleagues said to me “I have a problem with my code when a path contains a french accent”…

The code looks something like this:

std::string sTest = "C:\\Temp\\TestAccent\\é.txt";

At first I didn’t react and in fact this code is translated correctly into machine language, but something is bothering me …
Indeed it turns out that by default / chance the encoding of files in Visual Studio uses the Windows Ansi encoding of the OS.
So on a French OS (with a Windows-1252 codepage) there won’t be any problems but what will happen on a Chinese or Vietnamese OS when the compiler will parse our accent or even display it? I think that the same source code file opened on an Asian OS for example will not display an é but rather another character so it is not very reassuring to investigate an encoding problem.

So my first test was to create the file in question on the file system that uses UTF16 on it and then use the Windows MBCS API and see how that accent is encoded in memory.

 

#include <windows.h>
#include <string>
#include <iostream>
#include <locale>
#include <codecvt>

int main()
{
   std::string pattern("C:\\Temp\\TestAccent");
   std::string filepath = pattern;

   // First get path with MBCS Windows api
   pattern.append("\\*");
   WIN32_FIND_DATAA data;
   HANDLE hFind;
   if ((hFind = FindFirstFileA(pattern.c_str(), &data)) != INVALID_HANDLE_VALUE) {
      do {
         std::string fname = data.cFileName;
         if (fname[0] == '.') continue; // Skip . and ..
         filepath = filepath + "\\" + fname;
         std::cout << filepath << std::endl;

      } while (FindNextFileA(hFind, &data) != 0);
      FindClose(hFind);
   }
   return 0;
}

When we look at memory, we observe the following:

So with the Windows MBCS APIs (those that end with an A) our accent é is coded on 1 byte with the value 0xe9.

Let’s continue our exploration and use the c++ std::filesystem functions to see how the filename is encoded:

#include <string>
#include <iostream>

#include <filesystem>
namespace fs = std::filesystem;

#define BOOST_AUTO_LINK_SYSTEM 
#include <boost/filesystem.hpp>
namespace bfs = boost::filesystem;

int main()
{
   std::string str, stru8;

   // c++17, std::filesystem
   std::string path("C:\\Temp\\TestAccent");
   for (const auto& entry: fs::directory_iterator(path))
   {
      str = entry.path().string();
      const char* pStr = str.c_str();
      stru8 = entry.path().u8string();
      const char* pU8Str = stru8.c_str();
   }

   // c++17, boost::filesystem
   for (const auto& entry: bfs::directory_iterator(path))
   {
      str = entry.path().string();
      const char* pStr = str.c_str();
      
      //u8string() does not exists in boost
   }

   return 0;
}

if we examine the memory:

std::directory_entry.string () returns an MBCS string with é encoded on 1 character (0xe9).
std::directory_entry.u8string () returns a utf8 string with é encoded on 2 characters (0xc3 0xa9).
boost has no u8string method and encodes strings like the std version ie in MBCS.

But let’s come back to my colleague’s initial problem …
she uses an abstraction library on top of boost::filesystem which only works with utf8 paths so when we get an MBCS path it causes an exception:

#include <windows.h>
#include <string>
#include <iostream>

int main()
{
   std::string pattern("C:\\Temp\\TestAccent");
   std::string filepath = pattern;

   // First get path with MBCS Windows api
   pattern.append("\\*");
   WIN32_FIND_DATAA data;
   HANDLE hFind;
   if ((hFind = FindFirstFileA(pattern.c_str(), &amp; data)) != INVALID_HANDLE_VALUE) {
      do {
         std::string fname = data.cFileName;
         if (fname[0] == '.') continue; // Skip . and ..
         filepath = filepath + "\\" + fname;
         std::cout << filepath << std::endl;

      } while (FindNextFileA(hFind, &amp; data) != 0);
      FindClose(hFind);
   }

   // this methods only understands utf8 path 
   // and here filepath holds a MBCS string (C:\Temp\TestAccent\é.txt) => !!!!! EXCEPTION !!!!!
   if (Hal::FileUtils::Exists(filepath))
   {
      //Do something
   }

   return 0;
}

So it seems simple just convert an MBCS string (encoded with the OS codepage) into UTF8.
So I started looking at how to do this in standard C++ i.e. without using Windows functions directly and after some research I turned to boost::locale::conv with the following code:

std::string utf8_string = boost::locale::conv::to_utf<char>(filepath, "HowCanIKnowWhatToPutHere");

The first problem is knowing what to pass as the encoding name, intuitively I know it must be something like 1252 but what exactly?
The easiest way is to look in the sources of boost and by debugging we come across the following code located in the file src/encoding/wconv_codepage.ipp:

windows_encoding all_windows_encodings[] = {
        { "big5",       950, 0 },
        { "cp1250",     1250, 0 },
        { "cp1251",     1251, 0 },
        { "cp1252",     1252, 0 },
        { "cp1253",     1253, 0 },
        { "cp1254",     1254, 0 },
        { "cp1255",     1255, 0 },
        { "cp1256",     1256, 0 },
        { "cp1257",     1257, 0 },
        { "cp874",      874, 0 },
        { "cp932",      932, 0 },
        { "cp936",      936, 0 },
        { "eucjp",      20932, 0 },
        { "euckr",      51949, 0 },
        { "gb18030",    54936, 0 },
        { "gb2312",     20936, 0 },
        { "gbk",        936, 0 },
        { "iso2022jp",  50220, 0 },
        { "iso2022kr",  50225, 0 },
        { "iso88591",   28591, 0 },
        { "iso885913",  28603, 0 },
        { "iso885915",  28605, 0 },
        { "iso88592",   28592, 0 },
        { "iso88593",   28593, 0 },
        { "iso88594",   28594, 0 },
        { "iso88595",   28595, 0 },
        { "iso88596",   28596, 0 },
        { "iso88597",   28597, 0 },
        { "iso88598",   28598, 0 },
        { "iso88599",   28599, 0 },
        { "koi8r",      20866, 0 },
        { "koi8u",      21866, 0 },
        { "ms936",      936, 0 },
        { "shiftjis",   932, 0 },
        { "sjis",       932, 0 },
        { "usascii",    20127, 0 },
        { "utf8",       65001, 0 },
        { "windows1250",        1250, 0 },
        { "windows1251",        1251, 0 },
        { "windows1252",        1252, 0 },
        { "windows1253",        1253, 0 },
        { "windows1254",        1254, 0 },
        { "windows1255",        1255, 0 },
        { "windows1256",        1256, 0 },
        { "windows1257",        1257, 0 },
        { "windows874",         874, 0 },
        { "windows932",         932, 0 },
        { "windows936",         936, 0 },
};

First disappointment this table is hard coded but on the other hand we see that I can pass either cp1252 or windows1252.
Ok I’m testing it works but I can’t leave a hard-coded string like this because it will work on Windows in French but not necessarily elsewhere.
Second disappointment I realize that it is not possible to simply convert an MBCS string because:
– Microsoft does not provide an API which makes the link between the codepage and a string. To be exact there are the GetACP () and GetLocaleInfo () APIs:

   char strCodePage[10];	
   UINT codePage;

   if (GetLocaleInfoA(GetSystemDefaultLCID(), LOCALE_IDEFAULTANSICODEPAGE,
      strCodePage, sizeof(strCodePage) / sizeof(TCHAR)) > 0) 
   {
      // ANSI code page id
      // strCodePage = "1252" on a French OS
      codePage = atoi(strCodePage);
   }

GetLocaleInfo returns a string but it’s not the same as the one used by boost in its array …

In a perfect world boost should add an entry in its array (let’s call it “mbcs” or “CP_ACP”) and when passing this string
this would call GetACP () to retrieve the value of the codepage for conversion and this is what the code would look like:

#include <windows.h>
#include <string>
#include <iostream>

#include <boost/locale.hpp>

//========================================================================================
// BIG WARNING: CODE BELOW DOES NOT WORK AND ONLY SHOW HOW IT SOULD BE IN A PERFECT WORLD
//========================================================================================
std::string mbcs_to_utf8(const std::string& sMbcs)
{
   // I REPEAT CODE BELOW CANNOT NOT WORK
   return boost::locale::conv::to_utf<char>(sMbcs, "CP_ACP");
   // I REPEAT CODE ABOVE CANNOT NOT WORK
}

Conclusion of this article, I could be wrong but it seems that it is not possible to simply convert mbcs string to utf8 with standard c++.
So the solution is to first convert to UTF16 using a Windows API then convert to utf8 using pure c ++ this time:

#define _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS
#include <windows.h>
#include <string>
#include <locale>
#include <codecvt>

/////////////////////////////////////////////////////////////////////////
// Function below converts a mbcs string into utf8
/////////////////////////////////////////////////////////////////////////

std::string mbcs_to_utf8(const std::string& sMbcs)
{
   //Convert mbcs string(codepage) to wide string (UTF16)
   int count = ::MultiByteToWideChar(CP_ACP, 0, sMbcs.c_str(), -1, nullptr, 0);
   std::wstring ws(count, L'\0');
   ::MultiByteToWideChar(CP_ACP, 0, sMbcs.c_str(), -1, &ws[0], count);

   //Convert wstring to utf8
   std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
   return myconv.to_bytes(ws);
}

This article makes me think that it is dangerous to mix APIs which use UTF8 strings with others which use MBCS strings because quite quickly we will end up with incompatibility problems and crashes.
If I have the courage this will be the subject of another article entitled: Why Using UTF8 inside a Windows Program is Dangerous Until Microsoft Standardizes APIs.
Under Windows when using std::string one should reserve UTF8 to exchange data between the binary and the outside world.
To do a little teasing since C++20 introduced the char8_t type Microsoft could add to its standard libc versions taking const char8_t *:

// TO BE ABLE TO USE UTF8 ON WINDOWS, MICROSOFT
// COULD ADD libc functions taking char8_t
// AND ON LINUX/MACOS char8_t would be an alias to char
FILE* fopen(char8_t const* _FileName, char8_t const* _Mode);
int printf(char8_t const* const _Format, ...);
[C++]: Convert a MBCS string to UTF8 in standard c++