Horizontal, vertical and other formats (2024)

Next: Task dependency/reuseability of resourcesUp: Problems and issues Previous: The size of the
Recommendations

Horizontal, vertical and other formats

We must, for practical as well as definitional reasons,restrict our attention to corpora considered as collections oftexts or textual samples of language. Texts are linear;syntactic structures, on the other hand, are often representedin two-dimensional terms, especially as tree structures, or(in greater detail) as tree structures, the nodes of which aresets of attributes and values. As far as syntactic annotationis concerned, we are interested only in how these two- ormulti-dimensioned structures are represented in relation tothe linearity of texts.

There are two general commonly-used linear formats for storing,inputting and outputting text data: horizontal and vertical. It ispossible to represent a syntactically annotated text in either ofthese formats, without changing the nature of the annotation. Theconversion of a horizontal to a vertical format or vice versa is arelatively trivial operation if undertaken automatically. However, fromthe user's point of view, the difference between the two formats iscertainly not trivial, as it may make the difference between anintelligible and an unintelligible presentation. We will use examplesfrom some corpora to illustrate this.

**Table 1:** Horizontal format
[N The_AT door_NN1 ,_, [Fr [N which_DDQ N] [V was _VBDZ
equipped_VVN [P with_IW [N neither_LE [ bell_NN1 nor_CC
knocker_NN1 ] N] P] V] Fr] N] ,_,
[V was_VBDZ [blistered_VVN and_CC distained_VVN ] V] ._.

**Table 2:** Vertical format
The	AT	[N
door	NN1
,	,
which	DDQ	[Fr[N]
was	VBDZ	[V
equipped	VVN
with	IW	[P
neither	LE	[N
bell	NN1	[
nor	CC
knocker	NN1	]N]P]V]Fr]N]
,	,
was	VBDZ	[V
blistered	VVN
and	CC
distained	VVN	V]
.	.

Table 3 is an example in horizontal format from the IBM Paris Treebank (Langé 1994).

**Table 3:** Horizontal format: IBM Paris Treebank
[N Ce_DDEMMS guide_NCOMS N] [V [P leur_PPCA6MP P]permet_VINIP3
[P de_PREPD [Vi se_PPRE6MP familiariser_VPRN [Pavec_PREP
[N les_DARDFP opérations_NCOFP [P de_PREPD [Nréseau_NCOMS
[A local_AJQMS A]N]P] [A effectuées_VTRPSFP [Ppar_PREP
[N les_DARDMP utilisateurs_NCOMP N]P]A]N]P]Vi]P]V] ._.

The horizontal format is more compact, and is easier to read so longas the amount of syntactic information interspersed with the words isnot too dense. The vertical format is more convenient and morereadable if there is too much syntactic information to beconveniently shown in the horizontal format. Moreover, the verticalformat lends itself to a number of parallel fields of information, sothat (for example) the actual orthographic text (as a sequence ofword forms and punctuation marks) can be separated out from thesequence of morphosyntactic tags, and both of these separated fromthe representation of a phrase structure tree. Other fields maycontain corpus location references, and deep syntactic information(such as ellipsis) alongside in a separate field from the surfacesyntactic information. Table 4 is an example from the SUSANNEcorpus (Sampson 1995), which gives an impression of the variousaligned information types that can be given. The columns (i.e fields)contain the following information:

Field 1:: text references
Field 2:: part-of-speech tags
Field 3:: the text words
Field 4:: base-form (lemmatised forms of Field 3; e.g. said is lemmatised as `say')
Field 5:: syntactic annotation (brackets and labels)

**Table 4:** Vertical format: SUSANNE
A01:0010a	YB	<minbrk>		[Oh.Oh]
A01:0010b	AT	The	the	[O[S[Nns:s.
A01:0010c	NP1s	Fulton	Fulton	[Nns.
A01:0010d	NNL1cb	County	county	.Nns]
A01:0010e	JJ	Grand	grand	.
A01:0010f	NN1c	Jury	jury	.Nns:s]
A01:0010g	VVDv	said	say	[Vd.Vd]
A01:0010h	NPD1	Friday	Friday	[Nns:t.Nns:t]
A01:0010i	AT1	an	an	[Fn:o[Ns:s.
A01:0010j	NN1n	investigation	investigation	.
A01:0020a	IO	of	of	[Po.
A01:0020b	NP1t	Atlanta	Atlanta	[Ns[G[Nns.Nns]
A01:0020c	GG	+<apos>s	-	.G]
A01:0020d	JJ	recent	recent	.
A01:0020e	JJ	primary	primary	.
A01:0020f	NN1n	election	election	.Ns]Po]Ns:s]
A01:0020g	VVDv	produced	produce	[Vd.Vd]
A01:0020h	YIL	<ldquo>	-	.
A01:0020i	ATn	+no	no	[Ns:o.
A01:0020j	NN1u	evidence	evidence	.
A01:0020k	YIR	+<rdquo>	-	.
A01:0020m	CST	that	that	[Fn.
A01:0030a	DDy	any	any	[Np:s.
A01:0030b	NN2	irregularities	irregularity	.Np:s]
A01:0030c	VVDv	took	take	[Vd.Vd]
A01:0030d	NNL1c	place	place	[Ns:o.Ns:o]Fn]
				Ns:o]Fn:o]S]
A01:0030e	YF	+.	-	.O]

The field that indicates the structure of the sentence can bemade more graphically explicit by the use of indentation. Theexample from TOSCA in table 5illustrates this. On the first level isUtterance, the second level NP, VP and PP, and so on. (Thisindented format is in fact an intermediate structure, the final outputbeing represented as a tree on the screen.)

**Table 5:** Indented format: TOSCA
-:TXTU()
UTT:S(act,indic,inter,motr,pres,unm)
INTOP:AUX(do,indic,pres){Does}
SU:NP()
NPHD:PN(pers,sing){he}
:VP(act,do,indic,motr)
MVB:LV(indic,infin,motr){realise}
OD:CL(act,indic,intens,pres,unm,zsub)
SU:NP()
NPHD:PN(pers,sing){he}
V:VP(act,indic,intens,pres)
MVB:LV(indic,intens,pres){is}
CS:AJP(prd)
AJHD:ADJ(prd){wrong}
PUNC:PM(qm){?}

Next: Task dependency/reuseability of resourcesUp: Problems and issues Previous: The size of the