11 August 2014

Corpus linguistics

Corpus linguistics = study of linguistic phenomena through data obtained from a corpus. The leading figure of this discipline is Douglas Biber. Noam Chomsky criticized corpus linguistics, arguing you can never have a corpus of all utterances and most importantly common everyday English because even spoken sections are mainly from television where people tend to speak rather formally.


Corpus (pl: corpora) is a collection of texts assumed to be representative of a given language so that it can be used for linguistic analysis. Texts must be available in machine-readable form. All modern dictionaries are based on corpora where word can be traced in terms of frequency, collocations and set phrases. Before corpora, language studies were based on a subjective approach of a lexicographer.

The corpus-based analysis is:
*      empirical, analyzing the actual patterns of language use
*      utilizes a large collections of natural texts
*      uses computers which make possible to analyze complex patterns of language use, allowing the storage and analysis of a larger database than could be dealt with manually
*      depends on both quantitative and qualitative analytical techniques. Frequency data reveals how often a certain pattern occurs relative to other patterns. Going beyond the quantitative patterns means to propose functional interpretations explaining why the patterns exist. It is not only about computing data but primarily about the interpretation.

The study considers association patterns which represent quantitative relations, measuring the extent to which features are associated with contextual factors.
Linguistic associations
*      lexical associations - associations with particular words
*      grammatical associations - associations with particular grammatical constructions
*      lexical-grammatical associations - grammatical feature and considering its lexical associations
Non-linguistic associations
*      distribution across registers - varieties defined by situation
*      distribution across dialects - varieties defined by a social group
*      distribution across time periods

Creation of corpus
1.       Put a text to .TXT to delete any formatting.
2.       POS tagging – label words according to their part-of-speech. Tagged word becomes a token.
3.       Annotation - add interpretative linguistic information about the corpus.

Examples of available corpora
The very first corpus was created by an Italian monk Roberto Busa on the work by Thomas of Aquin. He wrote words on paper cards so they were not processable. Brown Corpus was the first corpus on punch cards (děrné štítky) compiled in 1964 by Henry Kučera at Brown University, contained one million words.
British National Corpus (1980-1993) has 100 million words, made at Oxford University, contains 90% of written and 10% of spoken register. Corpus of Contemporary American English (COCA, 1990-2012) has 450 million words, financed by Mormons. International Corpus of English is a project on varieties of English. COBUILD called the Bank of England offers 450 million words but is not free.
Other corpora: ICAME, Bergen Corpus of London Teenage Language, Corpus of Middle English Prose or Verse, The Penn-Helsinki Parsed Corpus of Middle English, CHILDES (child language data), London-Lund Corpus (contains spontaneous and prepared speech), Lancaster-Oslo/Bergen Corpus (also spoken language), Longman-Lancaster Corpus (only academic prose and fiction).
Why to use corpus instead of searching machines like Google? It cannon look at differences between registers of English, changes over time, frequency of collocations, possible compounds, cannot find you synonyms or words related by a semantic field.

Types of corpora analysis
Lexicological analysis is concerned with the meaning and use of words, central to dictionary making. Corpus-based lexicological investigations address these questions:
What are the meanings associated with a particular word?
*      identify meanings by looking at occurrences in contexts, rather than relying on intuition.
*      One of the advantages of corpus-based research is that the corpus can be used to show all the contexts in which a word occurs. It is then possible to identify the different meanings associated with a word. Programs called concondancers can display the occurrences of a chosen word with its surrounding context. Such displays are called concordance listings where each occurrence of the chosen word is presented on a single line, with the word in the middle and context on each side. These displays are referred to as KWIC - Key Word in Context. The list can be in alphabetical order, or can also be generated in the order of occurrence of words or in order of frequency.
What is the frequency of a word relative to other related words?
*      allowing us to identify common and uncommon words, especially useful in designing teaching materials for language students
What collocations does a particular word have?
*      What words commonly co-occur with a particular word? - collocations. Left collocates are those co-occurring words immediately preceding the target word. Right collocates are the co-occurring words immediately following the target word.


What is a word's distribution across registers?
*      Raw counts show the actual number of occurrences of the word in each register but a comparison of raw counts cannot be used to conclude that a word is more common in one register than another. We should rather need a measurement of how often a reader will come across a particular word -> norm counts convert the number of occurrences of a word to a standard scale, in this case per 100 000 words.
How are synonymous words used in different ways?
*      many words that are considered synonymous do not have a total identical meaning and are used differently like use of seemingly synonymous big, large, great.
Lemma is used to mean the base form of a word, disregarding grammatical changes such as tense and plurality. It is a dictionary entry which can take various forms (work - works, working, worked). Node is a basis of a word on which the rest called stem connects.

Grammatical analysis - the study of grammar is often seen as a prescriptive discipline (what is absolutely correct by grammarians), however, grammatical studies are rather descriptive (what is considered as acceptable by native speakers). Corpus-based research can be applied to grammar on sentence level and discourse level. Studying a morphological characteristic in a corpus can teach us both about the frequency and distribution.
*      Nominalizations are nouns that are derived from a verb or an adjective. We find that academic prose uses nominalizations to treat actions and processes as separated from human participants. Fiction and spoken discourse are more often concerned with people and use verbs and adjectives to describe how they are behaving.
*      Counting grammatical categories is not easy task since there is a question how to identify them. For example, how to deal with pronouns which often take place of nouns or whether auxiliary verbs should be included in the overall verb count since they do not provide any lexical content.
*      Syntactic construction - we can investigate for example complement clauses which complete the meaning of the verb, two types are that clauses and to-clauses. Across registers that-clauses are very common in conversation but not so common in academic prose. In contrast, to-clauses are moderately common in both conversation and academic prose.

Lexico-grammar analysis is particularly useful when we are attempting to distinguish between words or structures that are nearly synonymous in meaning. Verbs can be grouped according to their valence = their potential for combining with other clause elements. Transitive pattern - with a noun phrase as direct object. Intransitive pattern - no object. Copula pattern. One or more adverbials can be freely added to all the patterns.
*      For example we can analyze nearly synonymous words begin versus start. Despite their similarities, these two verbs are in fact typically used in very different structures. Thanks to corpora, we found out that start is more commonly used as an intransitive verb and begin is much more commonly used as a transitive verb.

Multidimensional analysis includes multiple texts from a wide range of spoken and written registers. The goal is to include a wide range of the linguistic features that have functional associations: tense and aspect, place and time adverbials, pronouns, proverbs, questions, nominal forms. passives and active, subordination and coordinating features, prepositional phrases, adjectives, adverbs, modals and negation.
*      The analyst is faced with an overwhelming amount of data so a statistical procedure known as factor analysis can be used to show which of the linguistic features tend to co-occur in texts. Each set of co-occurring features is called a dimension of variation. It is possible to compute dimension scores for each text and to compare texts and registers.
*      Linguists have also long been interested in the development of student writing. Two popular measures of particular student writing development have been the number of words per text and the average length of T-units in a text = an independent clause with all its dependent clauses to measure syntactic complexity. Researchers used to compare student writing across grade levels in terms of overall essay length and average T-unit length. The ability to write longer essays and longer T-units was taken as a sign of increased proficiency. Previous studies have focused on a small number of speakers but multi-dimensional approach can analyze various dimensions.

*      Corpus-based techniques are useful for investigating second-language acquisition. Errors produced by second-language students have been discussed from a variety of perspectives like the gravity of errors and the nature of errors as interlanguage.

No comments:

Post a Comment