Corpus linguistics = study of
linguistic phenomena through data obtained from a corpus. The leading figure of
this discipline is Douglas Biber.
Noam Chomsky criticized corpus
linguistics, arguing you can never have a corpus of all utterances and most
importantly common everyday English because even spoken sections are mainly
from television where people tend to speak rather formally.
Corpus (pl:
corpora) is a collection of texts assumed to be representative of a given
language so that it can be used for linguistic analysis. Texts must be
available in machine-readable form. All modern dictionaries are based on
corpora where word can be traced in terms of frequency, collocations and set
phrases. Before corpora, language studies were based on a subjective approach of
a lexicographer.
The corpus-based analysis is:
empirical, analyzing
the actual patterns of language use
utilizes a large
collections of natural texts
uses computers which
make possible to analyze complex patterns of language use, allowing the storage
and analysis of a larger database than could be dealt with manually
depends on both quantitative and qualitative analytical techniques. Frequency data reveals how often a certain pattern occurs relative
to other patterns. Going beyond the quantitative patterns means to propose functional interpretations explaining
why the patterns exist. It is not only about computing data but primarily about
the interpretation.
The study considers association patterns which represent
quantitative relations, measuring the extent to which features are associated
with contextual factors.
Linguistic associations
lexical associations - associations with
particular words
grammatical associations - associations
with particular grammatical constructions
lexical-grammatical associations - grammatical feature and considering its lexical associations
Non-linguistic associations
distribution across registers - varieties defined by situation
distribution across dialects - varieties defined by a social group
distribution across time periods
Creation of corpus
1.
Put a text to .TXT to
delete any formatting.
2.
POS tagging – label words according to their part-of-speech. Tagged word becomes a token.
3.
Annotation - add interpretative linguistic information about the corpus.
Examples
of available corpora
The very first corpus was created by an
Italian monk Roberto Busa on the work by
Thomas of Aquin. He wrote words on paper cards so they were not processable. Brown Corpus
was the first corpus on punch cards (děrné štítky) compiled in 1964
by Henry Kučera at Brown
University, contained one million words.
British National Corpus (1980-1993)
has 100 million words, made at Oxford University, contains 90% of written and
10% of spoken register. Corpus of Contemporary American English (COCA,
1990-2012)
has 450 million words, financed by Mormons. International Corpus of English
is a project on varieties of English. COBUILD called the Bank of England offers 450
million words but is not free.
Other corpora: ICAME, Bergen Corpus of London Teenage Language,
Corpus of
Middle English Prose or Verse, The Penn-Helsinki Parsed Corpus of Middle English,
CHILDES
(child language data), London-Lund Corpus (contains spontaneous and
prepared speech), Lancaster-Oslo/Bergen Corpus (also spoken
language), Longman-Lancaster
Corpus (only academic prose and fiction).
Why to use corpus instead of searching
machines like Google? It cannon look at differences between registers of
English, changes over time, frequency of collocations, possible compounds,
cannot find you synonyms or words related by a semantic field.
Types
of corpora analysis
Lexicological
analysis is concerned with
the meaning and use of words, central to dictionary making. Corpus-based
lexicological investigations address these questions:
What are
the meanings associated with a particular word?
identify meanings by
looking at occurrences in contexts, rather than relying on intuition.
One of the advantages
of corpus-based research is that the corpus can be used to show all the
contexts in which a word occurs. It is then possible to identify the different
meanings associated with a word. Programs called concondancers can display the
occurrences of a chosen word with its surrounding context. Such displays are
called concordance listings where each occurrence of the chosen word is
presented on a single line, with the word in the middle and context on each
side. These displays are referred to as KWIC - Key Word in Context. The list can be in
alphabetical order, or can also be generated in the order of occurrence of
words or in order of frequency.
What is
the frequency of a word relative to other related words?
allowing us to
identify common and uncommon words, especially useful in designing teaching materials
for language students
What collocations
does a particular word have?
What words commonly
co-occur with a particular word? - collocations. Left collocates are those
co-occurring words immediately preceding the target word. Right collocates are
the co-occurring words immediately following the target word.
What is a
word's distribution across registers?
Raw counts show the actual number of occurrences of the word in each register but
a comparison of raw counts cannot be used to conclude that a word is more
common in one register than another. We should rather need a measurement of how often a reader will come across a
particular word -> norm counts convert the number of occurrences
of a word to a standard scale, in this case per 100 000 words.
How are
synonymous words used in different ways?
many words that are
considered synonymous do not have a total identical meaning and are used
differently like use of seemingly synonymous big, large, great.
Lemma is used to mean the
base form of a word, disregarding grammatical changes such as tense and plurality.
It is a dictionary entry which can take various forms (work - works, working, worked). Node is a basis of a word on
which the rest called stem connects.
Grammatical
analysis - the study of
grammar is often seen as a prescriptive discipline (what is absolutely correct
by grammarians), however, grammatical studies are rather descriptive (what is
considered as acceptable by native speakers). Corpus-based research can be
applied to grammar on sentence level and discourse level. Studying a morphological
characteristic in a corpus can teach us both about the frequency and distribution.
Nominalizations are nouns that are derived from a verb or an adjective. We find that
academic prose uses nominalizations to treat actions and processes as separated
from human participants. Fiction and spoken discourse are more often concerned
with people and use verbs and adjectives to describe how they are behaving.
Counting grammatical categories is not easy task since there is a question how to identify them. For
example, how to deal with pronouns which often take place of nouns or whether
auxiliary verbs should be included in the overall verb count since they do not
provide any lexical content.
Syntactic construction - we can investigate for example complement clauses which complete the
meaning of the verb, two types are that clauses and to-clauses. Across
registers that-clauses are very common in conversation but not so common in
academic prose. In contrast, to-clauses are moderately common in both conversation
and academic prose.
Lexico-grammar
analysis is particularly
useful when we are attempting to distinguish between words or structures that
are nearly synonymous in meaning. Verbs can be grouped according to their valence
= their potential for combining with other clause elements. Transitive pattern - with a noun phrase
as direct object. Intransitive pattern
- no object. Copula pattern. One or
more adverbials can be freely added to all the patterns.
For example we can
analyze nearly synonymous words begin
versus start. Despite their similarities, these two verbs are in fact
typically used in very different structures. Thanks to corpora, we found out
that start is more commonly used as an intransitive verb and begin is much more
commonly used as a transitive verb.
Multidimensional
analysis includes multiple
texts from a wide range of spoken and written registers. The goal is to include
a wide range of the linguistic features that have functional associations: tense
and aspect, place and time adverbials, pronouns, proverbs, questions, nominal
forms. passives and active, subordination and coordinating features, prepositional
phrases, adjectives, adverbs, modals and negation.
The analyst is faced
with an overwhelming amount of data so a statistical procedure known as factor analysis
can be used to show which of the linguistic features tend to co-occur in
texts. Each set of co-occurring features is called a dimension of variation. It is
possible to compute dimension scores for each text and to compare
texts and registers.
Linguists have also long
been interested in the development of student writing. Two popular measures of
particular student writing development have been the number of words per text and the average length of T-units in a text = an independent clause with
all its dependent clauses to measure syntactic complexity. Researchers used to
compare student writing across grade levels in terms of overall essay length
and average T-unit length. The ability to write longer essays and longer T-units
was taken as a sign of increased proficiency. Previous studies have focused on
a small number of speakers but multi-dimensional approach can analyze various
dimensions.
Corpus-based
techniques are useful for investigating second-language acquisition. Errors
produced by second-language students have been discussed from a variety of perspectives
like the gravity of errors and the nature of errors as interlanguage.
No comments:
Post a Comment