Bug 53 - abcde fails with accented characters from CD-TEXT
Summary: abcde fails with accented characters from CD-TEXT
Status: CONFIRMED
Alias: None
Product: abcde
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Linux
: Normal normal
Assignee: Steve McIntyre
URL:
Depends on:
Blocks:
 
Reported: 2017-01-29 13:00 GMT by Matthias Andree
Modified: 2017-02-01 07:53 GMT (History)
1 user (show)

See Also:


Attachments
"script" recording of abcde output. WARNING: contains control characters such as embedded CR without LF, and ESC (CSI) sequences, from ripper and encoder progress outputs. (160.56 KB, text/plain)
2017-01-29 13:00 GMT, Matthias Andree
Details
abcde.hQUT... directory as .tar.gz tarball (2.42 KB, application/octet-stream)
2017-01-29 13:01 GMT, Matthias Andree
Details
fix grep callouts from cddb-tool to use -a (4.12 KB, patch)
2017-01-29 17:59 GMT, Matthias Andree
Details
partial fix for CD-TEXT tagging of MP3 (through eyeD3) and FLAC (metaflac) (3.50 KB, patch)
2017-01-29 18:00 GMT, Matthias Andree
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Andree 2017-01-29 13:00:22 GMT
Created attachment 42 [details]
"script" recording of abcde output. WARNING: contains control characters such as embedded CR without LF, and ESC (CSI) sequences, from ripper and encoder progress outputs.

I am using abcde v2.8.1-2-gc91ca32, and am trying to rip a CD that:
- does *not* have musicbrainz data
- would have CDDB data, albeit with misspellings, at http://www.freedb.org/freedb/misc/c012a20f (accessible with ripit)
- does have CD-TEXT with accented French/Spanish characters

And am running in a German UTF-8 locale, details below.

abcde fails to recode the CD-TEXT to the proper character set, and consequently fails to properly encode/tag/move the files out to their final place.  

abcde does *not* fall back to CDDB, which maybe it should.

I am *not* saving the offered CDDB data file for edit, which - according to vim - is in latin1 format (aka ISO-8859-1, plausible).

command line:
$ script -e -c "abcde -o 'mp3:-b 256,flac' -D -V -a move -d /dev/sr0 2" abcde-debug-cdtext-accented.txt

effective abcde command line:
$ abcde -o 'mp3:-b 256,flac' -D -V -a move -d /dev/sr0 2


The full log is attached.


This is what remains in the abcde... directory:
-rw-rw-r-- 1 mandree mandree      350 Jan 29 13:09 cddbchoices
-rw-rw-r-- 1 mandree mandree       67 Jan 29 13:09 cddbquery
-rw-rw-r-- 1 mandree mandree      801 Jan 29 13:09 cddbread.0
-rw-rw-r-- 1 mandree mandree      892 Jan 29 13:09 cddbread.1
-rw-rw-r-- 1 mandree mandree     2604 Jan 29 13:09 cd-text
-rw-rw-r-- 1 mandree mandree      135 Jan 29 13:09 discid
-rw-rw-r-- 1 mandree mandree      497 Jan 29 13:11 errors
-rw-rw-r-- 1 mandree mandree      351 Jan 29 13:11 status
-rw-rw-r-- 1 mandree mandree 24033309 Jan 29 13:11 track2.flac
-rw-rw-r-- 1 mandree mandree  8425220 Jan 29 13:11 track2.mp3

==> errors <==
tagtrack-mp3-2: returned code 1: nice -n 10 eyeD3 --set-encoding utf16-LE -A Übereinstimmungen in Binärdatei /var/tmp/abcde.hQUTxkBbsM1ZcMfAAwT.F0Zcz6A-/cddbread.1 -a Übereinstimmungen in Binärdatei /var/tmp/abcde.hQUTxkBbsM1ZcMfAAwT.F0Zcz6A-/cddbread.1 -t Marr�n Y Azul -G 255 -n 2 -N 15 /var/tmp/abcde.hQUTxkBbsM1ZcMfAAwT.F0Zcz6A-/track2.mp3
tagtrack-flac-2: returned code 1: nice -n 10 metaflac --no-utf8-convert --import-tags-from=- /var/tmp/abcde.hQUTxkBbsM1ZcMfAAwT.F0Zcz6A-/track2.flac


I am attaching the text files in a .tar.gz file later.


Observe that sed and grep (GNU in my case, on Ubuntu Linux 16.04.X LTS) fail to treat these files, for instance:

sed: -e Ausdruck #1, Zeichen 36: Nicht beendeter `s'-Befehl

this happens because there is an "á" Latin1 character (0xE1) that might be UTF-8 sequence starter last in the title, but then it sees the ~ from sed and flags an invalid sequence because the next character in UTF-8 would have to be in the range 0x80...0xBF, and cancels the parse.

translation aid from German:
* Übereinstimmungen in Binärdatei => matches in binary file (probably grep mistaking the text file for binary due to mismatched character set)
* sed: Ausdruck #1, Zeichen 36: Nicht beendeter `s'-Befehl => sed: expression #1, character 36: unterminated `s' command


I am trying to patch this up so it works. There are more bugs with parsing icedax CD-TEXT output (it can't identify the album artist, for instance).


locale details:

$ env | egrep 'LANG|LC' | sort
GDM_LANG=de_DE
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LC_ADDRESS=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_MONETARY=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_NUMERIC=de_DE.UTF-8
LC_PAPER=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_TIME=de_DE.UTF-8

$ locale
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=
Comment 1 Matthias Andree 2017-01-29 13:01:27 GMT
Created attachment 43 [details]
abcde.hQUT... directory as .tar.gz tarball
Comment 2 Matthias Andree 2017-01-29 17:59:55 GMT
Created attachment 44 [details]
fix grep callouts from cddb-tool to use -a

fix for cddb-tool to have it force grep to treat "binary" as text (such as ISO-8859-1 data in UTF-8 locales)
Comment 3 Matthias Andree 2017-01-29 18:00:41 GMT
Created attachment 45 [details]
partial fix for CD-TEXT tagging of MP3 (through eyeD3) and FLAC (metaflac)

against git master, on top of the cddb-tool fix. Use git am or git apply.
Comment 4 Andrew Strong 2017-02-01 07:53:28 GMT
Hi Matthias,

I confess that my time with computers in general is being scaled back this year, along with my commitment to abcde but I recognise the usual depth of thought behind your patches and I have committed both.

Work on the deeper issues you have pointed out will be work of another I am afraid as I have some major non-computer related projects for 2017 and possibly 2018...

Thanks yet again for your contributions!

Andrew