perm filename CHARAC.PRO[NET,GUE] blob sn#003914 filedate 1972-07-13 generic text, type T, neo UTF8
00100	                      ARBITRARY CHARACTER SETS
00200	
00300	                          by John McCarthy
00400	
00500	
00600		It  would  be  nice  to  be  able to have documents stored in
00700	computers that could include arbitrary characters and to be  able  to
00800	display  them  on  any  CRT screen, edit them using any keyboard, and
00900	print them on any printer.  The  object  of  this  memorandum  is  to
01000	suggest how to get there from here with special reference to the ARPA
01100	network.
01200	
01300		Where are we now?
01400	
01500		1. At present, there is  96  character  ASCII,  and  everyone
01600	agrees that it should be included in any larger set.
01700	
01800		2.  Many  installations  are  dependent  on 64 character sets
01900	which do not even include the lower case latin alphabet.
02000	
02100		3. At the Stanford  Artificial  Intelligence  Laboratory,  we
02200	have  a  114 character set that includes 96 character ASCII and which
02300	is implemented in our keyboards, displays, and line printer.
02400	
02500		4. Printers are becoming available that get  their  character
02600	designs  out  of  memory,  for example, the Xerox XGP printer, one of
02700	which we are getting.
02800	
02900		5. The IMLAC type display has the character designs  in  main
03000	memory  so  that  changing  the  displayed  set  is  just a matter of
03100	reloading the memory.
03200	
03300		6. Many display systems share the character  generator  among
03400	many  display  units.  In some of these, e.g. the Datadisc, arbitrary
03500	sets are probably feasible (using kludgery to  be  described  later),
03600	but  in  other  systems,  e.g.  our  III's,  arbitrary  sets  are not
03700	feasible.
03800	
03900		One possible approach to communication in expanded  character
04000	sets  is  to  produce an expanded standard set of characters, perhaps
04100	using 8 or 9 bits and expect new equipment  to  implement  this  set.
04200	This  approach  has the disadvantage that it will be very hard to get
04300	agreement on what the  next  step  should  be,  and  even  if  formal
04400	agreement is realized, many groups will find it in their interests to
04500	ignore the standard.
04600	
04700		Therefore, I would like to suggest that the next step  be  to
04800	arbitrary   character  sets.   I  suggest  implementing  this  in  the
04900	following way:
05000	
05100		1. There be established a registry of characters.  Anyone can
05200	register  a  new  character.   Each character has a unique number, 17
05300	bits should be enough even to include Chinese.   Besides  this,  each
05400	character  has  a  name  in  ASCII  usually  mnemonic.   Finally, the
05500	character has a design which is a picture on a 50 by 50 dot matrix.
05600	
05700		2. Besides the registry of characters, there is a registry of
05800	character  sets,  which  different  groups  are  using  for different
05900	classes of documents.  A registered  character  set  has  a  registry
06000	number  and  a  table giving the correspondence between the character
06100	codes as bit sequences and the registered character numbers.
06200	
06300		3. Associated with a document is a statement of the character
06400	code used therein.  This may be one of the registered codes or it may
06500	contain in addition modifications described  by  an  auxiliary  table
06600	giving  the code correspondence with registered character numbers.  A
06700	character code may have an escape character that says that  the  next
06800	character  is  described by its registry number. The statement of the
06900	character code may be a header on the document or  the  receiver  may
07000	have  to  learn  it  by  some  other  means, e.g. because its library
07100	catalog entry contains this information.
07200	
07300		4. Devices such as printers and displays draw  characters  in
07400	different  ways and standardization doesn't seem feasible at present.
07500	Therefore, it is necessary  to  provide  a  way  of  going  from  the
07600	standard  description  of  a character using a 50 by 50 dot matrix to
07700	whatever method the device uses.  This is up to the  programmers  who
07800	are  supporting the device.  Some may choose to manually create files
07900	describing how registered characters are implemented. They  may  find
08000	it  too  much  work  to  provide for all the characters and to update
08100	their files when new characters are registered. Others  will  provide
08200	programs  for  going from the registered descriptions to descriptions
08300	compatible with their implementations.  Perhaps most will hand tailor
08400	the characters most used and provide a program for the others.
08500	
08600		5.  The  easiest device to handle is the line printer because
08700	it is slow.  At the beginning of the print  job,  the  SPOOL  program
08800	will  look up the character set and load the printers memory with the
08900	character designs used in the particular document.  Sometimes, it may
09000	have  to  go  through the network to one of the computers that stores
09100	the registry in order to find out what to do.
09200	
09300		6. Display systems that have  a  character  memory  for  each
09400	display  unit  can  be  handled  in  about  the same way.  Users will
09500	occasionally  experience  delays  when  the  display   programs   are
09600	surprised by unfamiliar characters.
09700	
09800		7. Display systems that share character memories require more
09900	complicated treatment.  The object is to keep the memory large enough
10000	to keep all the characters that the current set of users is using and
10100	to handle the required table lookups  from  the  different  character
10200	codes  in  a nice way.  There will be limitations on the diversity of
10300	character sets that can be in use simultaneously.  Systems  like  the
10400	Datadisc that only look up the character when it is first written can
10500	be extended to work with large sets.  Systems that have  to  look  up
10600	each  character  code  30  times  per second in order to maintain the
10700	display won't work so well.
10800	
10900		I have no special ideas about how to make keyboards adaptable
11000	to arbitrary sets.  Each user may have to fend for himself.
11100	
11200		In  this  memorandum  so far, I have ignored typography, i.e.
11300	the fact that in printed documents the same letter may be printed  in
11400	many  fonts.   Perhaps,  each  character  in each font will require a
11500	separate registered  description,  but  with  a  constant  difference
11600	between  the  numbers  of  the  same  character  in  different fonts.
11700	Installations will again have to decide what font  distinctions  they
11800	will implement.
11900	
12000		Some  other issues that might be considered are whether means
12100	can be provided to adapt texts automatically to  the  line  and  page
12200	lengths of the different devices.
12300	
12400		It  seems  to  me most likely that the typographical problems
12500	cannot be solved at  this  time,  and  it  would  be  best  to  adopt
12600	conventions for registering character designs at this time, and leave
12700	typography for later.
12800	
12900		In my opinion, there is no real obstacle to establishing  the
13000	registry in the ARPA network now, getting the standards organizations
13100	to work, and being able to exchange documents in  extended  character
13200	sets  as  soon  as the various installations can acquire the printers
13300	and display devices.
13400	
13500		It  is  the  present  policy  of  the   Stanford   Artificial
13600	Intelligence Laboratory to acquire no more devices that are wedded to
13700	fixed character sets.