Does (should) LC_COLLATE affect character ranges?

Collation order through LC_COLLATE defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:
unset LANGUAGE LC_ALL
echo B | LC_COLLATE=en_US grep '[a-z]'
Intuitively, B isn't in [a-z], so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B is found, because the range a-zincludes everything that's between a and z in the collation order, including the capital letters Bthrough Z.
All systems tested do have the en_US locale generated. I also tried varying the locale: on the machines where B is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}, also Chinese locales) except Japanese (in any available encoding) and C/POSIX.
What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?
(Note that I'm specifically asking about the behavior of character ranges such as [a-z] in en_USlocales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)

On two Debian machines, one where B is in [a-z] and one where it isn't, the output ofLC_COLLATE=en_US locale -k LC_COLLATE is
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="ISO-8859-1"
and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE is
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2039
collate-codeset="UTF-8"

ANSWER:-

If you are using anything other than the C locale, you shouldn't be using ranges like [a-z] since these are locale-dependent and don't always give the results you would expect. As well as the case issue you've already encountered, some locales treat characters with diacritics (eg á) the same as the base character (ie a).
Instead, use a named character class:
echo B | grep '[[:lower:]]'
This will always give the correct result for the locale. However, you need to choose the locale to reflect the meaning of both your input text and the test you are trying to apply.

0 comments:

Post a Comment

Don't Forget to comment