[Q] unicode support

Discussion:

[Q] unicode support

Didier Verna

2012-09-26 07:46:53 UTC

Content preview: Hello, does anyone have a global view on the status of unicode
support in the main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL,
ACL, LispWorks). By that, I mean mostly being able to encode source files
in utf8 with the following concerns: [...]

Content analysis details: (-100.8 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-100 USER_IN_WHITELIST From: address is in the user's white-list
-0.8 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain
Archived-At: <http://permalink.gmane.org/gmane.lisp.cl-pro/710>

Hello,

does anyone have a global view on the status of unicode support in the
main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL,
LispWorks). By that, I mean mostly being able to encode source files in
utf8 with the following concerns:

- do I need to do something special at the implementation-level ?
- do I need to do something special at the ASDF system level ?
- could this break the code of people using my libraries ?

Thank you.

--
Resistance is futile. You will be jazzimilated.

Scientific site: http://www.lrde.epita.fr/~didier
Music (Jazz) site: http://www.didierverna.com

Marco Antoniotti

2012-09-26 07:48:24 UTC

Permalink

Content preview: Excellent question. Time for a CDR? MA On Sep 26, 2012, at
09:46 , Didier Verna wrote: [...]

Content analysis details: (-0.8 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 SPF_PASS SPF: sender matches SPF record
-0.8 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain
Archived-At: <http://permalink.gmane.org/gmane.lisp.cl-pro/711>

Excellent question. Time for a CDR?

MA

Post by Didier Verna
Hello,
does anyone have a global view on the status of unicode support in the
main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL,
LispWorks). By that, I mean mostly being able to encode source files in
- do I need to do something special at the implementation-level ?
- do I need to do something special at the ASDF system level ?
- could this break the code of people using my libraries ?
Thank you.
--
Resistance is futile. You will be jazzimilated.
Scientific site: http://www.lrde.epita.fr/~didier
Music (Jazz) site: http://www.didierverna.com
_______________________________________________
pro mailing list
http://lists.common-lisp.net/cgi-bin/mailman/listinfo/pro

--
Marco Antoniotti, Associate Professor tel. +39 - 02 64 48 79 01
DISCo, Università Milano Bicocca U14 2043 http://bimib.disco.unimib.it
Viale Sarca 336
I-20126 Milan (MI) ITALY

Please note that I am not checking my Spam-box anymore.
Please do not forward this email without asking me first.

Faré

2012-09-26 15:54:12 UTC

Permalink

Post by Didier Verna
does anyone have a global view on the status of unicode support in the
main CL implementations (SBCL, CMU-CL, ECL, CCL, CLISP, ABCL, ACL,
LispWorks). By that, I mean mostly being able to encode source files in
- do I need to do something special at the implementation-level ?
- do I need to do something special at the ASDF system level ?
- could this break the code of people using my libraries ?

So far as I can tell, all these implementations support Unicode,
though some of them can be explicitly compiled without.

ASDF, since release 2.21 (April 2012) supports Unicode.
The recommended, backwards-compatible, incantation is
(in your defsystem, or any specific component):
#+asdf-unicode :encoding #+asdf-unicode :utf-8.

I have the intention of making utf-8 the default eventually,
but last we checked (in April this year), that would break
7 files out of all of quicklisp,
the authors of which have never replied regarding fixing them.

Also, if you (asdf:load-system :asdf-encodings) explicitly and early,
you can use :encoding :latin1, or :encoding euc-jp,
or whichever encoding your implementation supports
(asdf-encodings at this time won't transcode things for you).

An example system that explicitly uses this UTF-8 support
is lambda-reader, which I published earlier this year
(last edited in April also), based on an initial implementation
by Brian Mastenbrook.

—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org
To send men to the firing squad, judicial proof is unnecessary... These
procedures are an archaic bourgeois detail. This is a revolution! And a
revolutionary must become a cold killing machine motivated by pure hate.
— Che Guevara

Robert Smith

2012-09-26 17:23:16 UTC

Permalink

I think it might be worthwhile to look at unicode beyond just seeing
if files can encoded as utf8.

The concept of "unicode support" is pretty loaded. What does it mean?
Does unicode support mean that one can operate on strings stored in a
particular fashion? Does it mean functions like LENGTH handle
overlaying characters correctly (e.g., any character plus a circumflex
overlaying character... does that have length 1 or 2?)? Do the
printers support stuff like right-to-left printing?

See http://stackoverflow.com/a/6163129 for details on why unicode
support isn't a simple concept.

Cheers,

Robert Smith

So far as I can tell, all these implementations support Unicode,
though some of them can be explicitly compiled without.
ASDF, since release 2.21 (April 2012) supports Unicode.
The recommended, backwards-compatible, incantation is
#+asdf-unicode :encoding #+asdf-unicode :utf-8.
I have the intention of making utf-8 the default eventually,
but last we checked (in April this year), that would break
7 files out of all of quicklisp,
the authors of which have never replied regarding fixing them.
Also, if you (asdf:load-system :asdf-encodings) explicitly and early,
you can use :encoding :latin1, or :encoding euc-jp,
or whichever encoding your implementation supports
(asdf-encodings at this time won't transcode things for you).
An example system that explicitly uses this UTF-8 support
is lambda-reader, which I published earlier this year
(last edited in April also), based on an initial implementation
by Brian Mastenbrook.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org
To send men to the firing squad, judicial proof is unnecessary... These
procedures are an archaic bourgeois detail. This is a revolution! And a
revolutionary must become a cold killing machine motivated by pure hate.
— Che Guevara
_______________________________________________
pro mailing list
http://lists.common-lisp.net/cgi-bin/mailman/listinfo/pro

Nikodemus Siivola

2012-09-29 12:28:34 UTC

Permalink

Content preview: On 26 September 2012 20:23, Robert Smith <quad-***@public.gmane.org>
wrote: > I think it might be worthwhile to look at unicode beyond just seeing

if files can encoded as utf8. > The concept of "unicode support" is pretty

loaded. What does it mean? > Does unicode support mean that one can operate
on strings stored in a > particular fashion? Does it mean functions like
LENGTH handle > overlaying characters correctly (e.g., any character plus
a circumflex > overlaying character... does that have length 1 or 2?)? Do
the > printers support stuff like right-to-left printing? [...]

Content analysis details: (-100.7 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low
trust
[209.85.220.179 listed in list.dnswl.org]
-100 USER_IN_WHITELIST From: address is in the user's white-list
Archived-At: <http://permalink.gmane.org/gmane.lisp.cl-pro/719>

I think it might be worthwhile to look at unicode beyond just seeing
if files can encoded as utf8.
The concept of "unicode support" is pretty loaded. What does it mean?
Does unicode support mean that one can operate on strings stored in a
particular fashion? Does it mean functions like LENGTH handle
overlaying characters correctly (e.g., any character plus a circumflex
overlaying character... does that have length 1 or 2?)? Do the
printers support stuff like right-to-left printing?

I think CL standard is pretty clear on what LENGTH does -- Unicode
doesn't come into it, /unless/ you happen to be on an implementation
that supports custom sequence types and defined one that understands
combining characters.

The only place where standard really hooks into Unicode is external
formats. Most (all?) of the tricky unicode stuff should IMO be
separate functions, instead of introducing subtleties to standard
ones.

I think some crucial questions are:

* What is CHAR-CODE-LIMIT?

* Are there holes in the char-code range?

* Which external formats are supported?

* Can strings contain arbitrary codepoints, or only things that
represent fully-fledged characters? (Can UTF-8b be supported?)

* Can users define new external formats?

* Are multiple line-ending conventions supported?

* BOM?

* Are the character names there?

* Is the unicode database the implementation needs to have anyways
accessible via a documented API?

* Is everything that should be O(1) O(1), or are some things O(N) with Unicode?

* Are there multiple string representations? (Eg. one for 0-255 range,
one for full code-char range.)

Cheers,

-- Nikodemus