0004647: TOC export does not properly encode CD text fields - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0004647	ardour	bugs	public	2012-01-21 21:42	2012-02-13 20:02

Reporter	anrug	Assigned To	anrug
Priority	normal	Severity	minor	Reproducibility	always
Status	closed	Resolution	fixed
Target Version	3.0-beta3

Summary	0004647: TOC export does not properly encode CD text fields
Description	Ardour (both 3.x and 2.x) writes CD text fields in UTF-8 (might be my system's setting) when exporting cdrdao TOC files. But cdrdao expects Latin1 (aka ISO 8859-1) with double quotes escaped and any non-ascii-printable characters in octal representation ("\123"). For Asian languages cdrdao (and CD text) uses yet another notation, if someone is interested to get that working with Ardour I would need some help from people familiar with Asian language encodings. Ardour should do the necessary encoding/escaping and maybe warn the user when a character he used can't be encoded as CD text.
Additional Information	Attached is a file which shows a C++ code fragment that the cdrdao project uses when writing CD text fields in TOC files. The file also has two functions and a minimal test program (sorry, only simple ANSI C) which I sketch out to get the strings in the proper representation.
Tags	No tags attached.

2012-01-21 21:42	a3_toc_patch.c (3,104 bytes) #include <stdio.h> /** * Code fragments for properly encoding UTF-8 strings in cdrdao TOCs * * Andreas Ruge, Jan 2012 / / This is taken from cdrdao 1.2.3, it shows how text fields fro TOCs are escaped (cdrdao runs under the C locale): out << " \""; for (i = 0; i < dataLen_ - 1; i++) { if (data_[i] == '"') { out << "\\\""; } else if (isprint(data_[i])) { out << data_[i]; } else { sprintf(buf, "\\%03o", (unsigned int)data_[i]); out << buf; } } out << "\""; } / /* * Print a string in the format used for CD text strings in cdrdao TOC files * * This is: * a) escape double quotes with a backslash * b) print all characters from 0x20 - 0x7E (printable ascii) * c) use octal three digit representation for all other values * d) enclose the whole string with double quotes * * Andreas Ruge, 2012 / void toc_print_string(const char s, FILE fp) { fprintf(fp, " \""); for ( ; s != '\0'; s++) { if (s == '"') { fprintf(fp, "\\\""); } else if (0x20 <= s && s <= 0x7E) { fprintf(fp, "%c", s); } else { fprintf(fp, "\\%03o", (unsigned char)s); } } fprintf(fp, "\""); } /* * Translate UTF-8 string to ISO 8859-1 (latin1) * * the dest buffer will never have to be larger than the src string * * Return * 0 on success * 1 when a unicode sequnece was found which can't be represented in * ISO 8859-1, or when the unicode sequence is invalid * * Andreas Ruge, 2012 / int utf8_to_latin1(unsigned char dest, int dest_len, unsigned char src) { int ret = 0; src = (unsigned char )src; while (src && dest_len) { if (!(src & 0x80)) { /* 7-bit (ASCII range) / dest++ = src++; dest_len--; } else if (((src & 0xFC) == 0xC0) && (((src + 1) & 0xC0) == 0x80)) { / bit pattern 110000xx 10xxxxxx, a two byte UTF-8 sequence with no more than 8 data bits used, i.e. can be translated straight to IS0 8859-1 / dest = src++ << 6; dest++ \|= src++ & 0x3F; dest_len--; } else { / byte is part of an UTF-8 sequence which can't be translated / ret = 1; src++; } } dest = '\0'; return ret; } / Test function, to be used on a UTF-8 terminal / int main(int argc, char argv[]) { char buf[100]; char p; int ret = utf8_to_latin1(buf, sizeof buf, argv[1]); if (ret == 0) { printf("All characters converted to ISO 8859-1\n"); } else { printf("Warning: some characters could not be converted to ISO 8859-1\n"); } /for (p = buf; p; p++) { printf("%x ", (unsigned char)p); } printf("\n");*/ toc_print_string(buf, stdout); printf("\n"); return 0; } a3_toc_patch.c (3,104 bytes)

2012-01-22 20:40	a3_toc_patch2.c (3,336 bytes) #include <stdio.h> /** * Code fragments for properly encoding UTF-8 strings in cdrdao TOCs * * Andreas Ruge, Jan 2012 / / This is taken from cdrdao 1.2.3, it shows how text fields fro TOCs are escaped (cdrdao runs under the C locale): out << " \""; for (i = 0; i < dataLen_ - 1; i++) { if (data_[i] == '"') { out << "\\\""; } else if (isprint(data_[i])) { out << data_[i]; } else { sprintf(buf, "\\%03o", (unsigned int)data_[i]); out << buf; } } out << "\""; } / /* * Print a string in the format used for CD text strings in cdrdao TOC files * * This is: * a) escape double quotes with a backslash * b) print all characters from 0x20 - 0x7E (printable ascii), except the backslash * c) use octal three digit representation for all other values * d) enclose the whole string in double quotes * * Andreas Ruge, 2012 / void toc_print_string(const char s, FILE fp) { fprintf(fp, " \""); for ( ; s != '\0'; s++) { if (s == '"') { fprintf(fp, "\\\""); } else if (0x20 <= s && s <= 0x7E && s != '\\') { fprintf(fp, "%c", s); } else { fprintf(fp, "\\%03o", (unsigned char)s); } } fprintf(fp, "\""); } /** * Translate UTF-8 string to ISO 8859-1 (latin1) * * the dest buffer will never have to be larger than the src string * * Return * 0 on success * 1 when a unicode sequnece was found which can't be represented in * ISO 8859-1, or when the unicode string is invalid * * Andreas Ruge, 2012 / int utf8_to_latin1(unsigned char dest, int dest_len, unsigned char src) { int ret = 0; while (src && dest_len) { if (!(src & 0x80)) { / 7-bit => ASCII range / dest++ = src++; dest_len--; } else { / 8-bit => UTF-8 multi-byte sequence / if (((src & 0xFC) == 0xC0) && (((src + 1) & 0xC0) == 0x80)) { / bit pattern 110000xx 10xxxxxx, a two byte UTF-8 sequence with no more than 8 data bits used, i.e. can be translated straight to IS0 8859-1 / dest = src++ << 6; dest++ \|= src++ & 0x3F; dest_len--; } else { / part of UTF-8 multi-byte sequence which can't be translated to ISO 8859-1 / ret = 1; src++; } } } dest = '\0'; return ret; } / Test function, to be used on a UTF-8 terminal. / int main(int argc, char argv[]) { char buf[100]; char p; int ret = utf8_to_latin1(buf, sizeof buf, argv[1]); if (ret == 0) { printf("All characters converted to ISO 8859-1\n"); } else { printf("Warning: some characters could not be converted to ISO 8859-1\n"); } /for (p = buf; p; p++) { printf("%x ", (unsigned char)p); } printf("\n");*/ printf("Encoded for cdrdao TOC:"); toc_print_string(buf, stdout); printf("\n"); return 0; } a3_toc_patch2.c (3,336 bytes)

anrug 2012-01-22 20:47 reporter ~0012613	I've uploaded a modified version of my (C) code. The cdrdao parser does not require and does not even allow the backslash to be escaped itself. So when you enter a track title with a leading backslash, the backslash will be written to the TOC file literally, resulting in something like: TITLE "MyTitle\" which will break the cdrdao toc parser. Worse, if you enter a title like 'Titel\123' you'll get 'TitleS' on your CD-R and nor warnings whatsoever. To work around this Ardour should write a backslash in octal representation, as done in the new file 'a3_toc_patch2.c'. Weired stuff. ;)

paul 2012-01-23 17:24 administrator ~0012617	svn rev 11314 contains an implementation of this. your code wasn't UTF-8 safe, but it did convey the general idea. I've tested it on a few test cases and it appears to work. please let me know if you are aware of any issues with it.

anrug 2012-01-23 20:08 reporter ~0012628	Hmm, what did you mean by "UTF-8 safe"? The function for escaping is meant to be used with a Latin1 encoded (8bit) string. I may be completely mistaken but in your change to svn 11314 I can't see where we get a Latin1 string in the first place.

paul 2012-01-23 20:13 administrator ~0012629	i created a marker called "Something ß for "Ü"

anrug 2012-02-13 20:02 reporter ~0012793	Has been fixed in a3

Date Modified	Username	Field	Change
2012-01-21 21:42	anrug	New Issue
2012-01-21 21:42	anrug	File Added: a3_toc_patch.c
2012-01-22 10:20	cth103	cost	=> 0.00
2012-01-22 10:20	cth103	Target Version	=> 3.0-beta3
2012-01-22 20:40	anrug	File Added: a3_toc_patch2.c
2012-01-22 20:47	anrug	Note Added: 0012613
2012-01-23 17:24	paul	Note Added: 0012617
2012-01-23 17:24	paul	Status	new => feedback
2012-01-23 20:08	anrug	Note Added: 0012628
2012-01-23 20:13	paul	Note Added: 0012629
2012-02-13 20:02	anrug	Note Added: 0012793
2012-02-13 20:02	anrug	Status	feedback => resolved
2012-02-13 20:02	anrug	Resolution	open => fixed
2012-02-13 20:02	anrug	Assigned To	=> anrug
2012-02-13 20:02	anrug	Status	resolved => closed