[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Scheme-reports] Sequence to sequence conversion

Hash: SHA1

On 07/02/2012 05:33 AM, Marc Feeley wrote:

> 2) The procedures specify in their names the character encoding to
> use.  But there are oodles of character encodings, so for easy
> extensibility to other encodings, it would be better to use a
> parameter as in (decode- string bytevector 'UTF-8) and
> (encode-string string 'UTF-8) instead of oodles of different
> procedures.

If we intend bytevectors to function as blobs, then
we need to be very clear that an 'encoding' is a thing
you can use to get information in some form you
understand from a blob, or to put information in some
form you understand into a blob. But that's not limited
to strings.  A blob, by its nature, can be anything.

For example, in an image processing program, someone
might have defined an 'image' record type containing
a size, optional color table, 2-dimensional array of
pixels, strings for embedded comments, etc.  He
would want to define encodings to read and write
'gif' and 'png' and 'bmp' and 'jpg' and so on.
Someone else might be dealing with, I dunno, binary
astronomy data, and want an encoding to read and
write the binary records of his ASTRA database, complete
with their 256-bit real numbers, ASCII names, etc.
Someone else handling communications software might
still need to hook something up to an EBCDIC
mainframe, and want to define an encoding to read
and write character data in that.

I agree that the standard ought to specify an encoding
(possibly bound to the symbol 'UTF-8) that allows reading
or writing character data (strings and program code).
But if so, I would say that an encoding ought to be a
first-class object bound to an identifier, which can
be passed as an evaluated argument into open-port,
decode-blob, etc, rather than a second-class object
invoked by passing a quoted token to those procedures.

We ought to think about what an encoding is, and
whether/how the user can define and use one of his/her
own.  I would say it's a function having some specific
signature, but that's kind of a default answer to
anything in a functional language.  If it were an object
language, it would be an object exposing several specific
methods.  Tomayto, tomawto.

I don't think we really need to specify means to make
user-defined encodings in WG1.  But even if we don't, I
think that encodings ought to be first-class in principle,
so that the code we're enabling people to write doesn't
break at some later time, or under some compatible
extension, when users can define encodings.

At a minimum, we need encodings to map blobs to and
from scheme data (including strings at the moment,
but conceptually including heterogenous vectors,
user defined record types, bignums, etc.  We may also
need some way to determine whether a particular byte
in a blob represents the end of some decodable unit
according to that encoding.

Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


Scheme-reports mailing list