View Issue Details

IDProjectCategoryView StatusLast Update
0004647ardourbugspublic2012-02-13 20:02
Reporteranrug Assigned Toanrug  
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionfixed 
Target Version3.0-beta3 
Summary0004647: TOC export does not properly encode CD text fields
DescriptionArdour (both 3.x and 2.x) writes CD text fields in UTF-8 (might be my system's setting) when exporting cdrdao TOC files. But cdrdao expects Latin1 (aka ISO 8859-1) with double quotes escaped and any non-ascii-printable characters in octal representation ("\123"). For Asian languages cdrdao (and CD text) uses yet another notation, if someone is interested to get that working with Ardour I would need some help from people familiar with Asian language encodings.

Ardour should do the necessary encoding/escaping and maybe warn the user when a character he used can't be encoded as CD text.
Additional InformationAttached is a file which shows a C++ code fragment that the cdrdao project uses when writing CD text fields in TOC files. The file also has two functions and a minimal test program (sorry, only simple ANSI C) which I sketch out to get the strings in the proper representation.
TagsNo tags attached.

Activities

2012-01-21 21:42

 

a3_toc_patch.c (3,104 bytes)   
#include <stdio.h>

/** 
 * Code fragments for properly encoding UTF-8 strings in cdrdao TOCs
 * 
 * Andreas Ruge, Jan 2012
 */
 
/* This is taken from cdrdao 1.2.3, it shows how text fields fro TOCs
   are escaped (cdrdao runs under the C locale):
  
    out << " \"";
    for (i = 0; i < dataLen_ - 1; i++) {
      if (data_[i] == '"') {
        out << "\\\"";
      }
      else if (isprint(data_[i])) {
        out << data_[i];
      }
      else {
        sprintf(buf, "\\%03o", (unsigned int)data_[i]);
        out << buf;
      }
    }

    out << "\"";
  }
*/




/** 
 * Print a string in the format used for CD text strings in cdrdao TOC files
 * 
 * This is:
 *      a) escape double quotes with a backslash
 *      b) print all characters from 0x20 - 0x7E (printable ascii)
 *      c) use octal three digit representation for all other values 
 *      d) enclose the whole string with double quotes
 * 
 * Andreas Ruge, 2012
 */
void toc_print_string(const char *s, FILE *fp)
{
    fprintf(fp, " \"");
    
    for ( ; *s != '\0'; s++) 
    {
        if (*s == '"') 
        {
            fprintf(fp, "\\\"");
        } 
        else if (0x20 <= *s && *s <= 0x7E)
        {
            fprintf(fp, "%c", *s);
        } 
        else 
        {
            fprintf(fp, "\\%03o", (unsigned char)*s);
        }
    }
    
    fprintf(fp, "\"");
}


/**
 * Translate UTF-8 string to ISO 8859-1 (latin1)
 * 
 *  the dest buffer will never have to be larger than the src string
 * 
 * Return
 *  0   on success
 *  1   when a unicode sequnece was found which can't be represented in
 *      ISO 8859-1, or when the unicode sequence is invalid
 * 
 * Andreas Ruge, 2012
 */
int utf8_to_latin1(unsigned char *dest, int dest_len, unsigned char *src)
{
    int ret = 0;
    
    src = (unsigned char *)src;
    while (*src && dest_len) 
    {
        if (!(*src & 0x80))
        {
            /* 7-bit (ASCII range) */
            *dest++ = *src++;
            dest_len--;
        }
        else if (((*src & 0xFC) == 0xC0) && ((*(src + 1) & 0xC0) == 0x80)) 
        {
            /* bit pattern 110000xx  10xxxxxx,
               a two byte UTF-8 sequence with no more than 8 data bits used,
               i.e. can be translated straight to IS0 8859-1 */
            *dest    = *src++ << 6;
            *dest++ |= *src++ & 0x3F;
            dest_len--;
        } 
        else
        {
            /* byte is part of an UTF-8 sequence which can't be translated */
            ret = 1;
            *src++;
        }
    }
    *dest = '\0';
    return ret;
}


/* Test function, to be used on a UTF-8 terminal */
int main(int argc, char *argv[]) {
    
    char buf[100];
    char *p;

    int ret = utf8_to_latin1(buf, sizeof buf, argv[1]);
    if (ret == 0) {
        printf("All characters converted to ISO 8859-1\n");
    } else {
        printf("Warning: some characters could not be converted to ISO 8859-1\n");
    }
    /*for (p = buf; *p; p++)
    {
        printf("%x ", (unsigned char)*p);
    }
    printf("\n");*/
    toc_print_string(buf, stdout);
    printf("\n");
    return 0;
    
}
a3_toc_patch.c (3,104 bytes)   

2012-01-22 20:40

 

a3_toc_patch2.c (3,336 bytes)   
#include <stdio.h>

/** 
 * Code fragments for properly encoding UTF-8 strings in cdrdao TOCs
 * 
 * Andreas Ruge, Jan 2012
 */
 
/* This is taken from cdrdao 1.2.3, it shows how text fields fro TOCs
   are escaped (cdrdao runs under the C locale):
  
    out << " \"";
    for (i = 0; i < dataLen_ - 1; i++) {
      if (data_[i] == '"') {
        out << "\\\"";
      }
      else if (isprint(data_[i])) {
        out << data_[i];
      }
      else {
        sprintf(buf, "\\%03o", (unsigned int)data_[i]);
        out << buf;
      }
    }

    out << "\"";
  }
*/




/** 
 * Print a string in the format used for CD text strings in cdrdao TOC files
 * 
 * This is:
 *      a) escape double quotes with a backslash
 *      b) print all characters from 0x20 - 0x7E (printable ascii), except the backslash
 *      c) use octal three digit representation for all other values 
 *      d) enclose the whole string in double quotes
 * 
 * Andreas Ruge, 2012
 */
void toc_print_string(const char *s, FILE *fp)
{
    fprintf(fp, " \"");
    
    for ( ; *s != '\0'; s++) 
    {
        if (*s == '"') 
        {
            fprintf(fp, "\\\"");
        } 
        else if (0x20 <= *s && *s <= 0x7E && *s != '\\')
        {
            fprintf(fp, "%c", *s);
        } 
        else 
        {
            fprintf(fp, "\\%03o", (unsigned char)*s);
        }
    }
    
    fprintf(fp, "\"");
}


/**
 * Translate UTF-8 string to ISO 8859-1 (latin1)
 * 
 *  the dest buffer will never have to be larger than the src string
 * 
 * Return
 *  0   on success
 *  1   when a unicode sequnece was found which can't be represented in
 *      ISO 8859-1, or when the unicode string is invalid
 * 
 * Andreas Ruge, 2012
 */
int utf8_to_latin1(unsigned char *dest, int dest_len, unsigned char *src)
{
    int ret = 0;
    
    while (*src && dest_len) 
    {
        if (!(*src & 0x80))
        {
            /* 7-bit => ASCII range */
            *dest++ = *src++;
            dest_len--;
        } 
        else
        {   
            /* 8-bit => UTF-8 multi-byte sequence */
            
            if (((*src & 0xFC) == 0xC0) && ((*(src + 1) & 0xC0) == 0x80)) 
            {
                /* bit pattern 110000xx  10xxxxxx,
                   a two byte UTF-8 sequence with no more than 8 data bits used,
                   i.e. can be translated straight to IS0 8859-1 */
                *dest    = *src++ << 6;
                *dest++ |= *src++ & 0x3F;
                dest_len--;
            } 
            else
            {
                /* part of UTF-8 multi-byte sequence which can't be
                   translated to ISO 8859-1 */
                ret = 1;
                *src++;
            }
        }
    }
    *dest = '\0';
    return ret;
}


/* Test function, to be used on a UTF-8 terminal. */
int main(int argc, char *argv[]) {
    
    char buf[100];
    char *p;

    int ret = utf8_to_latin1(buf, sizeof buf, argv[1]);
    if (ret == 0) {
        printf("All characters converted to ISO 8859-1\n");
    } else {
        printf("Warning: some characters could not be converted to ISO 8859-1\n");
    }
    /*for (p = buf; *p; p++)
    {
        printf("%x ", (unsigned char)*p);
    }
    printf("\n");*/
    printf("Encoded for cdrdao TOC:");
    toc_print_string(buf, stdout);
    printf("\n");
    return 0;
    
}
a3_toc_patch2.c (3,336 bytes)   

anrug

2012-01-22 20:47

reporter   ~0012613

I've uploaded a modified version of my (C) code. The cdrdao parser does not require and does not even allow the backslash to be escaped itself. So when you enter a track title with a leading backslash, the backslash will be written to the TOC file literally, resulting in something like:

TITLE "MyTitle\"

which will break the cdrdao toc parser. Worse, if you enter a title like 'Titel\123' you'll get 'TitleS' on your CD-R and nor warnings whatsoever.

To work around this Ardour should write a backslash in octal representation, as done in the new file 'a3_toc_patch2.c'. Weired stuff. ;)

paul

2012-01-23 17:24

administrator   ~0012617

svn rev 11314 contains an implementation of this. your code wasn't UTF-8 safe, but it did convey the general idea. I've tested it on a few test cases and it appears to work. please let me know if you are aware of any issues with it.

anrug

2012-01-23 20:08

reporter   ~0012628

Hmm, what did you mean by "UTF-8 safe"? The function for escaping is meant to be used with a Latin1 encoded (8bit) string. I may be completely mistaken but in your change to svn 11314 I can't see where we get a Latin1 string in the first place.

paul

2012-01-23 20:13

administrator   ~0012629

i created a marker called "Something ß for "Ü"

anrug

2012-02-13 20:02

reporter   ~0012793

Has been fixed in a3

Issue History

Date Modified Username Field Change
2012-01-21 21:42 anrug New Issue
2012-01-21 21:42 anrug File Added: a3_toc_patch.c
2012-01-22 10:20 cth103 cost => 0.00
2012-01-22 10:20 cth103 Target Version => 3.0-beta3
2012-01-22 20:40 anrug File Added: a3_toc_patch2.c
2012-01-22 20:47 anrug Note Added: 0012613
2012-01-23 17:24 paul Note Added: 0012617
2012-01-23 17:24 paul Status new => feedback
2012-01-23 20:08 anrug Note Added: 0012628
2012-01-23 20:13 paul Note Added: 0012629
2012-02-13 20:02 anrug Note Added: 0012793
2012-02-13 20:02 anrug Status feedback => resolved
2012-02-13 20:02 anrug Resolution open => fixed
2012-02-13 20:02 anrug Assigned To => anrug
2012-02-13 20:02 anrug Status resolved => closed