Додаток J: Набори символів

Appendix J: Character Sets

J.1 Introduction

UNIMARC records may be encoded using either 7-bit or 8-bit character code values. The specifications for identifying and using various character sets are described in the following sections of this appendix; they are in conformance with those contained in ISO 2022. That standard should also be consulted.

UNIMARC records may also be encoded using 16-bit character code values. See J.6 ISO 10646 character set.

J.2 Framework

A matrix for all character codes possible with 7‑bits is constructed as illustrated. Bits 7‑5 are represented by the columns, and bits 4‑1 by the rows. The ISO method of numbering is used, e.g. 7/15 not 7F for DEL.

columns

rows

0 1

2 3 4 5 6 7 0

SP

1

2

32

94 graphic characters

control

.

functions

.

15

DEL 7‑bit Code Matrix

A 7‑bit code set accommodates 32 control functions, 94 graphic characters, SPACE, and DELETE. The individual characters are commonly referred to by their column and row position in the matrix using the notation 'c/r', thus the SPACE character is 2/0. Code values are assigned according to the following rules. The first two columns of a code matrix are reserved for system control functions; columns 2‑7 are for graphic characters. The two corner codes of the graphic columns are reserved for SPACE and DELETE characters.

Data may also be encoded using 8‑bits per character, in which case the number of possible codes doubles, hence the code matrix doubles. Bits 8‑5 are represented by the column and bits

4‑1 by the rows. The 8‑bit matrix has four parts which are specified for control functions and graphic characters as illustrated.

00 01

02 03 04 05 06 07

08 09

10 11 12 13 14 15 0

SP

1

2

32

94 graphic characters

32

94 graphic characters .

control

.

functions

.

15

DEL

8-bit Code Matrix

The additional bit is the left-most bit and it is 0 for a left-hand part and 1 for a right-hand part. Graphic sets may be represented by either one 7 or 8 bit combination per character or, where there are a large number of characters in the set, by multiple 7 or 8 bit combinations per character.

Use of code sets require first the designation of the sets, then the invocation of a designated set as the working set. For both 7-bit and 8-bit codes, two sets of control functions and four graphic character sets may be designated at any given time. These designated sets are called the C0, C1 and G0, G1, G2, G3 sets. In 7-bits, two Cn sets and one Gn set may have invoked, working set status at a given time. In 8-bits, two Cn and two Gn sets may be in an invoked, working set, status at a given time. The following appendix sections specify the designation and invocation of code sets in UNIMARC.

J.3 Control Function Sets

The C0 and C1 control function sets are fixed for UNIMARC. Thus they do not need to be designated and invoked in the record. The C0 set is the set of 32 control functions defined in ISO 646. This set contains the basic transmission controls and the subfield delimiter, field terminator, and record terminator.

The C1 set is the set of control functions defined in ISO 6630, Bibliographic Control Characters. Only the NSB 'Non-sorting character(s) beginning', NSE 'Non‑sorting character(s) ending', PLD 'Partial Line Down' and PLU 'Partial Line Up' functions from that set are currently allowed in UNIMARC.

In the 7-bit and 8-bit environment, the C0 set occupies columns 0 and 1 at all times. In a 7-bit record, the characters from the C1 set are represented by the two character 'ESC F' where ESC is the 1/11 control function in the C0 set and F is a bit combination from columns 4 and 5. The F bit combinations associated with each of the functions defined in ISO 6630 were assigned by ISO at the time of registration of the set and are identified for ISO 6630

in section J.7 of this appendix. Note especially that in the 7-bit environment the 'ESC F' substitutes for the code table bit combinations of the ISO 6630 functions.

In an 8-bit record, the C1 set resides in columns 08 and 09, and the functions are represented by their code table bit combinations.

J.4 Graphic Character Sets

The G0 graphic set for UNIMARC is always ISO 646. All of the characters in the RECORD LABEL, the DIRECTORY, and the coded fields/subfields are from ISO 646, as are the field indicators and subfield codes. Thus a record always begins with ISO 646 as the working set. Up to three additional graphic sets may be designated as G1, G2 and G3 in field 100, subfield $a, character positions 28-29, Character Sets, and positions 30-33, Additional Character Sets. If no more than four sets are used in a record, the field 100 information is all that is required to designate the graphic sets. The0y can then be invoked as needed. Note that since the RECORD LABEL, DIRECTORY, and coded data fields are all coded using ISO 646, the G1, G2, and G3 designations in field 100 can be accessed before any additional graphic sets are encountered in the record.

J.4.1 7-Bit Environment

In a 7-bit character record the four designated sets are invoked using the following ISO 2022 locking shifts:

Acronym Full Name Bit Combination(s) Set Invoked SI Shift in 0/15 G0 SO Shift out 0/14 G1 LS2 Locking shift two ESC 6/14 G2 LS3 Locking shift three ESC 6/15 G3

These shifts are locking, so the set invoked remains the working set until another set is specified by a shift function. Since the record begins with the G0 (ISO 646) set as the working set, the SI shift to the G0 set will only be used when there has been an invocation of one of the other Gn sets as the working set. The G0 (ISO 646) set must be the working set at the end of each subfield and field since the succeeding subfield codes or directory processing require ISO 646 as the working set. This shift back to the G0 (ISO 646) set should take place before the subfield delimiter or end of field mark.

In 7-bits, a non-locking invocation of single characters from the designated G2 or G3 set is also possible. The following non-locking shifts are defined by ISO 2022:

Set from which Acronym Full Name Bit Combinations Single Character Invoked SS2 Single shift two ESC 4/14 G2 SS3 Single shift three ESC 4/15 G3

There is no need to reinvoke the working set after the single shifts as it is automatically reinstated after one character from the G2 or G3 set.

Examples (for clarity, bit combinations are in bold)

EX 1

                          SO       SI

500 11$aEdda S0/14æS0/15mundar.$mEnglish.$1Selections.

In this record, the ISO 5426 Extended Latin set has been designated the G1 set and the single character 'æ' is accessed via an invocation of that set.

EX 2

SS2

500 11$aEdda S1/11 4/14æmundar.$mEnglish.$1Selections.

If in EX 1 ISO 5426 had been designated a G2 set, the single shift function could be used to invoke the 'æ'.

EX 3

                 LS2                    SI          LS2                   SI

210 ##$a1/11 6/14Москва0/15$c"1/11 6/14Правда0/15"$d1968

In this record, ISO 5426 has been designated the G1 set and the basic Cyrillic set has been designated the G2 set. This field contains a Cyrillic name. Shifts into the G2 set must be made at the beginning of each subfield with shifts back into the G0 set at the end of each.

J.4.2 8-bit Environment

In an 8-bit code record the four designated sets are invoked using the following ISO 2022 locking shifts:

Acronym Full Name Bit Combinations Set Invoked/ Into Columns LS0 Locking shift zero 00/15 G0/02‑07 LS1 Locking shift one 00/14 G1/02‑07 LS1R Locking shift one right ESC 7/14 G1/10‑15 LS2 Locking shift two ESC 6/14 G2/02‑07 LS2R Locking shift two right ESC 7/13 G2/10‑15 LS3 Locking shift three ESC 6/15 G3/02‑07 LS3R Locking shift three right ESC 7/12 G3/10‑15

These shifts are locking, so the set invoked remains the working set until another set is invoked by a shift function.

Since the record begins with the G0 set (ISO 646) in columns 02‑07 and the G1 set in columns 10‑15, the shift functions to those sets will only be used when there has been an invocation of the G2 or G3 set into those columns. The G0 set must be the working set in columns 02‑07 at the end of each subfield and each field. The shift back to the G0 set when it has been temporarily displaced should occur before the subfield delimiter or end of field mark. The G1 set designated in field 100 is considered the default set for columns 10‑15; thus it should always be restored at the end of a field that has shifted another set into those columns.

In 8-bits, non-locking single shifts are not used in UNIMARC.

Examples (for clarity, bit combinations are in bold)

EX 1: 500 11$aEdda Sæmundar.$mEnglish.$1 Selections.

The ISO 5426 Extended Latin set has been designated the G1 set. No shift is required to use it in the 8-bit environment.

EX 2: LS2R LS1R

500 11$aEdda S1/11 7/13æ1/11 7/14mundar.$mEnglish.$1Selections.

The basic Cyrillic set has been designated the G1 set and the ISO 5426 Extended Latin set has been designated the G2 set. The G2 set is invoked to columns 10‑15 using the LS2R, displacing the default G1 set. Following the use of the G2 set, the G1 set is reinvoked into columns 10‑15.

EX 3: LS2R LS1R 210 #$al/11 7/13Москва$c"Правда1/11 7/14"$d1968

ISO 5426 is the default G1 set and the basic Cyrillic set has been designated the G2 set. The G2 set is invoked into columns 10‑15 when needed. Since the subfield code comes from the G0 set and it is still the column 02‑07 working set at the end of the $a

subfield, no shift need take place before the '$c'. The default G1 set is restored to columns 10‑15, however, at the end of the use of the Cyrillic set in this field.

EX 4: 305 ##$aВпервые иэдано в С.петерЬурге на нем. яэ. в 1770-1784 в 4-х

                                                                 LS2R       LS1R

частях под эаглавием "Reise durch Ru1/11 7/13ß1/11 7/14land zur Untersuchung der drey Natur-Reiche". Ч.4 на рус. яэ. не переведена

Basic Latin and Basic Cyrillic are the designated G0 and G1sets, and Extended Latin the G2 set (100 $a/26-33 = 010203##). The Basic Latin and Cyrillic characters can be accessed without change to the settings. The German 'ss' character (ß) is found in the Extended Latin set, which is invoked into columns 10-15 byLS2R (ESC 7/13), temporarily displacing Basic Cyrillic. This is then restored by LS1R(ESC 7/14).

J.5 Additional Graphic sets

In some instances more than the four graphic sets designated in field 100 may be required in a UNIMARC record. Additional sets may be substituted for the sets designated in field 100 through an escape of the form 'ESC I F'. 'I', which may be one or more characters in length, indicates the Gn designation of the set according to the following values:

Single Byte per Character Multiple Bytes per Character Gn Designation

    2/8 or 2/12		   2/4 2/8 or 2/4 2/12		  G0
    2/9 or 2/13		   2/4 2/9 or 2/4 2/13		  G1
    2/10 or 2/14		   2/4 2/10 or 2/4 2/14		  G2
    2/11 or 2/15		   2/4 2/11 or 2/4 2/15		  G3

F', the Final character, indicates the graphic set being designated. It is a bit combination from columns 4 to 7 that is assigned by ISO when the set is registered. The Final characters for the sets approved for use with UNIMARC are listed below. Final characters for other approved sets have not yet been assigned.

F Graphic Set 4/0 ISO 646 (IRV), Basic Latin set 5/0 ISO 5426‑1980, Extended Latin set 4/14 ISO Registration #37, Basic Cyrillic 5/1 ISO 5427-1984, Extended Cyrillic set 5/3 ISO 5428‑1980, Greek set 4/13 ISO 6438‑1983, African coded character set

If a fifth, etc., graphic set is needed in a UNIMARC field, it must first be designated through the escape sequence, then it may be invoked with shift functions as specified in Section J.4. When an additional set has been designated and invoked in a field, before the end of the field the original set specified in field 100 should be redesignated for the Gn via an escape sequence. When a field is exited, the G0, G1, G2, G3 designated sets must be those specified in field 100.

Example (for clarity, bit combinations are alternately bold and italic) Designation of LS1R Greek set as G1 454 #0$1700#0$aXenophon.$150010$a1/11 2/9 5/3 1/11 7/14'Áπομνημονευματα1/11 2/9 5/0 1/11 7/14

   													Redesignation of	LS1R
   													Extended Latin set
   													as G1 set

The record is for a Bulgarian translation of a Greek work and the language of cataloguing is English. The agency has designated in field 100 the following sets:

G0 ISO 646, Basic Latin G1 ISO 5426, Extended Latin G2 ISO Registration #37, Basic Cyrillic G3 ISO DIS 5427, Extended Cyrillic

When the Greek set is needed in the 454 field to give the original title in Greek, it is designated as the G1 set via the sequence ESC 2/9 5/3 and then invoked into columns 10‑15 via the sequence ESC 7/14. Before exiting the field, the Extended Latin set is restored to the G1 designation via ESC 2/9 5/0 and it is reinvoked into columns 10‑15 via ESC 7/14.

J.6 ISO 10646 character set

ISO 10646, being a 16-bit character set, contains all necessary characters. This will be used for the C0, C1 and all G sets.

J.7 Character set tables

Sections J.8 through J.10 contain the code tables for some of the character sets specified for use in UNIMARC records. These character sets are reproduced with the permission of the International Organization for Standardization (ISO). Copies of the complete standards can be obtained from the ISO Central Secretariat, Case postale 56, 1211 GENEVA 20, Switzerland, and from any ISO Member Body.

J.8 Basic Control Set – ISO 646 (IRV)

This control set is the C0 set for UNIMARC records.

The following positions are the only ones to be used in UNIMARC

Position Acronym Name 0/14 SO Shift Out 0/15 SI Shift In 1/11 ESC Escape 1/13 IS3 Information Separator Three 1/14 IS2 Information Separator Two 1/15 IS1 Information Separator One

In this Manual, the symbols for the Information Separators are :

IS1 $ (Subfield deliminator) IS2 @ (Field separator)

                               In most examples the end of field mark is not shown

IS3 % (Record terminator)

J.9 Bibliographic Control Set – ISO 6630: 1986

This control set contains control functions required for filing, sorting, permuting, etc. It is the C1 set for UNIMARC records. The following positions are the only ones to be used in UNIMARC:

Position Acronym Name 08/08 NSB Non-Sorting Character(s), Beginning 08/09 NSE Non-Sorting Character(s), End 08/11 PLD Partial Line Down 08/12 PLU Partial Line Up

In this Manual, the symbols for the non-sorting characters are:

NSB ¹NSB¹ NSE ¹NSE¹

PLU is used both to produce superscript text and to restore to the previous position subscript text created by the use of PLD. The reverse is also true, as is shown in the following example:

2³+3² is expressed as 2¹PLU¹3¹PLD¹+3¹PLU¹2¹PLD¹

J.10 Basic Latin Set – ISO 646 (IRV)

This graphic set is specified in ISO 646. It is the default G0 set for UNIMARC records.

Position Name Position Name 2/0 Space, Blank 5/0 Capital Letter P 2/1 Exclamation Mark 5/1 Capital Letter Q 2/2 Quotation Mark 5/2 Capital Letter R 2/3 Number Sign 5/3 Capital Letter S 2/4 Dollar Sign 5/4 Capital Letter T 2/5 Per Cent Sign 5/5 Capital Letter U 2/6 Ampersand 5/6 Capital Letter V 2/7 Apostrophe 5/7 Capital Letter W 2/8 Left Parenthesis 5/8 Capital Letter X 2/9 Right Parenthesis 5/9 Capital Letter Y 2/10 Asterisk 5/10 Capital Letter Z 2/11 Plus Sign 5/11 Left Square Bracket 2/12 Comma 5/12 Reverse Solidus 2/13 Hyphen, Minus Sign 5/13 Right Square Bracket 2/14 Full Stop, Period 5/14 Circumflex Accent 2/15 Solidus 5/15 Underline

3/0 Digit Zero 6/0 Grave Accent 3/1 Digit One 6/1 Small Letter a 3/2 Digit Two 6/2 Small Letter b 3/3 Digit Three 6/3 Small Letter c 3/4 Digit Four 6/4 Small Letter d 3/5 Digit Five 6/5 Small Letter e 3/6 Digit Six 6/6 Small Letter f 3/7 Digit Seven 6/7 Small Letter g 3/8 Digit Eight 6/8 Small Letter h 3/9 Digit Nine 6/9 Small Letter i 3/10 Colon 6/10 Small Letter j 3/11 Semi‑colon 6/11 Small Letter k 3/12 Less than Sign 6/12 Small Letter l 3/13 Equals Sign 6/13 Small Letter m 3/14 Greater than Sign 6/14 Small Letter n 3/15 Question Mark 6/15 Small Letter o

4/0 Commercial At 7/0 Small Letter p 4/1 Capital Letter A 7/1 Small Letter q 4/2 Capital Letter B 7/2 Small Letter r 4/3 Capital Letter C 7/3 Small Letter s 4/4 Capital Letter D 7/4 Small Letter t 4/5 Capital Letter E 7/5 Small Letter u 4/6 Capital Letter F 7/6 Small Letter v 4/7 Capital Letter G 7/7 Small Letter w 4/8 Capital Letter H 7/8 Small Letter x 4/9 Capital Letter I 7/9 Small Letter y 4/10 Capital Letter J 7/10 Small Letter z 4/11 Capital Letter K 7/11 Left Curly Bracket 4/12 Capital Letter L 7/12 Vertical Line 4/13 Capital Letter M 7/13 Right Curly Bracket 4/14 Capital Letter N 7/14 Tilde 4/15 Capital Letter O

N.B. If this set is used in combination with ISO 5426 positions 5/15, 6/0 and 7/14 in ISO 646 should not be used. Positions 5/8, 4/1 and 4/5 in ISO 5426 should be used instead.

Див. також

Додатки UNIMARC
Appendix C: character sets // UNIMARC Bibliographic Format Manual. — IFLA, 2023. — No 1.0.0. — P. 749–755. (англ.)
Appendix J Character sets 729-737 pp., UNIMARC Bibliographic, IFLA, 2008 (англ.)

Annexe J – Jeux de caractères, Bibliographic Transition in France (фр.)

unimarc.org.ua β

Додаток J: Набори символів

Зміст