Words (etc.) John Barnden School of Computer Science University of Birmingham Natural Language Processing 1 2010/11 Semester 2 Some Problems • Intro Exercise-Set B suggests that our intuitive notion of a “word” is vague, and inadequate even for informal purposes. – E.g., we think of sentences as formed of words, but probably wouldn’t think of “£12,000” as a word, or of “VAT” as a proper word (except when it means a large container, of course), but would nevertheless agree that “Mary bought the car of £12,000 plus VAT” was a properly formed sentence! • There are many sorts of abbreviation, acronym and special symbol in written language, and the relationship of such special units to spoken language is diverse. • And what about punctuation marks – are they words? They carry meaning, after all. • And what about things like “um”, “ah”, “ow”, “owwww” (in speech or text)? • New words created every day. A new one for me recently: “treggings” at Marks and Spencers. – Example of a “portmanteau” word, formed from melding other words – here “trousers” (??) and “leggings”. Word Separation • Words are typically not separated from each other in speech, and can subtly affect each other’s sound. So separation of speech into words is itself a theoretical act. • In many languages (e.g., Chinese, Japanese, old Latin) words are not even typically separated in writing. Differences between Words • In language study, differences of meaning or sound or spelling sometimes are, and sometimes aren’t, taken to indicate different words. – “Present”*noun:=gift+ and “present”*verb as in: present a proposal+ are typically regarded as different words though spelled the same. (Same spelling, different meaning and sound.) – “Bank”*noun:financial+ and “bank”*noun:of a river+ may be taken to be different words, but may instead be regarded as one word with a variable meaning. (Same spelling and sound, different meaning.) NB: there’s also one or more verbs spelled “bank,” including one with a financial sense. – “Patent”*noun:legal doc.+ can be pronounced in two different ways, but both are typically taken to be versions of just one word. (Same meaning and spelling, different sound.) – “Realize” and “realise”: typically regarded as alternative spellings of the same word. (Same meaning and sound, different spelling.) Special Terminology • Language studiers have introduced specialized terms to try to make things more precise, but still there’s some looseness and variability in the terminology. • We’ll look briefly at: – Homographs, homophones, homonyms – Lemmas – Citation forms – Wordforms, lexical forms – Lexemes – Lexical items HomoXs • According to J&M (pp.290, 646, 648): – “Homographs”: words with the same spelling but different sound, such as “live”*verb+ and “live”*adjective+. [I think J&M also mean that the words have different meanings, excluding the cases like “patent” above.] – “Homophones”: words with the same sound but different spelling, such as “to”, “too” and “two”. [I think J&M also mean that the words have different meanings, excluding the cases like “realise/realize” above.] And note that “to” has more than one meaning. – “Homonyms”: different word senses (meanings) that are of words with the same spelling and sound, as in “bank”. • BUT: Other academics, and dictionaries, may define that terminology somewhat differently. – My office dictionary [Webster’s Third New International] defines “homonym” to mean various things, none the same as J&M’s definition!! One (!!) of the meanings is: one of two or more words spelled and pronounced alike but different in meaning. And “homonym” can also mean the same as “homograph” or “homophone”!! More Terminology • A wordform (also: lexical form) [loosely following J&M p.120] is a particular written string or spoken sound that would be regarded as a word of the language. So all occurrences of the written string “presents” would be occurrences of the same written wordform, irrespective of sound or meaning. All occurrences of the spoken item that sounds like “to” would be of the same spoken wordform, irrespective of spelling or meaning. I’ll use lexical form in preference to wordform to emphasize inclusion of special units such as an abbreviation, acronym, or numerals. • A lexeme [J&M p.645] is a lexical form (spoken or written) together with a particular sense (meaning) for it. • A lemma or citation form [J&M p.645] is a particular wordform used as a sort of standard or basic version of the wordform in a lexeme. Thus “carpet” and “carpets” are in different lexemes but have the same lemma, “carpet”. The lemma for “sing”, “sang” and “sung” is “sing”. – Caution: J&M note on p.646 that “lemma” is sometimes used to mean the sense part of a lexeme. Also, they themselves give a definition significantly different from the above on p.120!! More Terminology, contd. • A lexical item or lexical entry is often used to mean the (main) items listed in a dictionary or lexicon (lexicon = database of words in, e.g., an AI system) and to which meanings are given by the dictionary or lexicon. • So lexical items are typically citation forms. • But note that dictionaries often list irregular inflected forms separately, e.g. “sung” (as past participle of “sing”). • An item in a dictionary can be a phrase rather than a single word. • “Lexical item” sometimes means the same as my “lexical form”. Different Sorts of Meaning-Difference • Homonymy or ambiguity: these are the names given to cases where a particular lexical form has a variety of senses. • A special case is polysemy, where different senses for a lexical form are related in some way. – E.g., “bank” can mean a financial institution or a building serving customers of it. – A “newspaper” can mean an institution or a particular physical object produced by that institution. – Exercise: “window” • Some people restrict “homonymy” and/or “ambiguity” to apply only to those cases of different-sense that aren’t cases of polysemy. Arbitrariness of Sense Differences • The number of senses a lexical form has, and what they are, is in large part a matter of choice and convenience for particular purposes. • Different dictionaries, NLP systems, etc. divide up senses differently. • Consider the verb “cut”, as applied to physical objects. Cutting proceeds significantly differently according to the type of object (cake, grass, meat, hair, ...). Do these correspond to different senses of “cut”?? • Consider “cut” as applied to government expenditure. Does this involve a different sense of cut – or is there just one very abstract sense that applies to expenditure and hair and grass and ... • To what extent do lexical forms have identifiable senses at all? I.e., perhaps the sense in action at a given point is often at least partly affected by the unique situation being talked about? Special Word Classifications • Lexical forms (and citation forms) are classified in a variety of ways. • We’ll look at a few main classifications: – of lexical forms into “parts of speech” (such as nouns and verbs; also often called “lexical categories” or “word classes”, though these terms are more ambiguous). – of nouns into “proper nouns” (David Cameron) and “common nouns” (car). – of citation forms into “open class” or “closed class”. • Those classifications are of particular importance to practical NLP. • Both have their grey (or dazzingly multicoloured?!) areas, as does everything concerning human language! That applies both within a single language and between languages. Why Important? • Parts of speech (POSs) – constitute a basic level of grammatical analysis: not much more may be possible in difficult circumstances – help with more complex grammatical analysis of utterances – are useful by themselves in specialized practical tasks such as “information extraction” and “named entity recognition” – are useful in more academic tasks such as searching corpora (large bodies of recorded language) for examples of desired types: e.g. Can search for all examples of the word “spaceship” preceded by an adjective and followed by a verb. • Proper nouns – Are important in practical tasks such as named entity recognition and document summarization. • Closed class words – are important in grammatical analysis of utterences – may need to be suppressed in other types of task. Parts of Speech (POSs) • Lexical forms are traditionally put into categories such as noun, pronoun, determiner, article, verb, adjective, adverb, preposition, particle, conjunction, interjection. E.g.: – Determiner: e.g.: the, a, this, ... and possibly: every, which – Article: the, a, an – Particle: e.g.: up, off, in, at, by when closely tied into a phrasal verb: e.g. “take up”. – Preposition: e.g.: up, off, in, at, by in freer uses – Conjunction: e.g.: and, or, but, since, if, when, because – Interjection: e.g.: hello, wow • Many lexical forms have more than one POS (consider love, three, off, that, kill) • Modern linguistic theories may propose more categories, from 12 upwards. • NLP systems typically use many more categories: from mid-30s (e.g. within the set of 45 “tags” in the Penn Treebank tagset *textbook+) up to 140+ • See various lists of POSs and related “tagsets” in sections 5.1-5.3 of textbook. How are POSs Defined? • With great difficulty! • One try: use (partly-)conceptual criteria as in: – Noun: denotes an entity or entity concept (car, snow, love, Tony Blair, Santa Claus, ...) – Verb: means an action, state, relationship, etc. (to push, to sleep, to love, to be, ...) – Adjective: adjusts an entity concept denoted by a nearby noun (red, sad, fake, ...) – Adverb: qualifies some event or state denoted by a nearby verb, adjective, adverb, clause, etc. (boldly, tomorrow, here, loosely, very(?), ...) – Determiner: specifies what specific entities (denoted via other words), if any, are being talked about by a nearby noun – NB: the qualifications above about what POSs are nearby are my own. • Such criteria are often mentioned, but are problematic. – “destruction” is a noun but refers to an action – auxiliary verbs, such as “have” in combination, have a special function – “my”: sometimes classed as adjective, sometimes as determiner. How are POSs Defined?, contd. • Something else that may help: morphological data, as in: – Certain lexical forms clump together like this: carrot, carrots, carrot’s, carrots’ man, men, man’s, men’s – Certian other lexical forms pattern like this: criticize, criticizes, criticized, criticized, criticizing sing, sings, sang, sung, singing • So we can postulate two different classes and call them nouns and verbs. – The distinction being drawn is: one class has a singular/plural dimension and a possessive/non-possessive dimension; the other has a singular/plural dimension, a tense dimension, and ... • In English, this sort of approach doesn’t extend very well to other words, which largely don’t inflect (change shape) etc. How are POSs Defined?, contd. • What may be more helpful is “distributive” data – i.e. data about how words go together in meaningful linguistic expressions, as in: – Only certain words can follow “the” or “a*n+”; of these, some must usually be followed by other words to be considered a meaningful unit; others don’t need to be. – So we can have: the/a car, the/a big car – But not usually: the/a big. – And car and big act differently after forms of “be”: – We can have : the thing is big but not the thing is car. – Certain words such as is and hated need words around them to make sense. • This sort of data (together perhaps with the above conceptual and morphological data) may make it useful to divide words into nouns, adjectives, determiners, etc. • Having done this, we may find it relatively easy to specify in general a grammar, i.e. a description of the strings of lexical forms that are allowed in the language, based on their assigned classes. How are POSs Defined?, contd. • The classification of some words in a given language is contentious, and differs between different schemes (see textbook). • The way words are classified in one language may not work well for another relatively distant one. We shouldn’t expect the notion of noun, verb, adverb, etc. familiar from one language to correspond in a simple way to categories in another language. • However, relatively close languages like English and French can be given at least roughly the same POSs, even though there’s a lot of detailed difference in the morphological and distributive data. • So perhaps one test of a classification scheme is how well it survives across (related) languages. How are POSs Defined?, contd. • Some boundaries are unclear: – past-participle verb forms can act as adjectives (as in the cactus is fully grown) – The boundaries between prepositions, particles and adverbs are unclear (consider around). • The situation is greatly complicated by the ability to transfer words between categories fairly freely. E.g.: – I’m going to powerpoint my lecture. *Could be said even if “powerpoint” has never previously been used as a verb.] – The good, the bad and the ugly. The big and the small. – He gave me an answer, but it was full of ifs and buts. – Shop Tesco. [A command possible in the USA.] – In car mechanic is car an adjective (derived from a noun) or is it a noun in a special noun-noun construct?? [The latter is the usual view in AI] How are POSs Defined?, contd. • In some schemes, some types of multi-word phrase are considered to have a POS: – conjunctions: as if, as long as, now that – pronouns: one another, each other – “proper” nouns: David Cameron, The Daily Telegraph. • Other complications: – Ellipsis allows abbreviated grammar: Strawberry, please (when asked for a flavour choice). This complicates the distributive data. – Idioms and special constructions sometimes involve strange uses of words, or words nowhere/hardly used elsewhere: by and large, the more I sing the better, by dint of – Foreign words and phrases need to be accounted for: mea culpa, quid pro quo – Quotations need to be accounted for, and the lexical forms inside them may not follow grammatical rules: She wrote “is ten are banana to five gold” on the blackboard. Possible to regard any quotation as a special sort of noun, referring to the string inside the quotation marks. Proper and Common Nouns • Proper nouns are, roughly speaking, those that refer to specific (though not necessarily real) entities in specific contexts: – David Cameron, David, Santa Claus, The Guardian, Love Actually [a film title], Edgbaston, University of Birmingham, Department of the Environment, School of Psychology, Prolog, English(?) – NB: many universities may have a School of Psychology: the “specificness” is only within particular contexts. • Common nouns are the remaining nouns: – car, carrot, bandwith, fifteen [in some uses], relationship, baking [as in the baking of the cake] • Not clear whether the following should count as proper nouns: – Christianity, Islam, Act V, January, Tuesday [when used to refer to a day, not an actress!], Celica [Toyota car model] • Not clear why nouns like fifteen and love aren’t considered to be proper nouns: they can be considered to refer to specific entitites, after all. Proper and Common Nouns, contd. • In English, proper nouns are usually spelled with an initial capital letter, if a single word, or at least have the main words within them initially-capitalized – But sometimes not, for special effect (advertisements) or choice (poet e.e. cummings) • But (in English) not all words spelled with a capital letter are proper nouns: they may be common nouns or not nouns at all: – Christianity, Islam, Tuesday, January ??? [nouns] – Blairite [can be noun or adjective] – Englishwoman [noun], English [can be adjective] – Islamic [not noun] – I [pronoun] – Birminghamize [not noun] – Many acronyms and abbreviations: NB, PS, NLP, CS, ABM, VD, DVD [most are nouns] • Anyway, the capitalization semi-criterion breaks down in many other languages (incl. ones close to English such as German) and in English sentence starters, document titles and section headings, headlines, etc. • Many proper nouns are spelled the same as common nouns or other words (apart from capitalization): Peter, Blacksmith. Closed Classes of Words • These are classes of lexemes that have largely or completely fixed, and either relatively small or very rule-prescribed, membership. E.g.: – Prepositions – Particles – Determiners – Conjunctions – Pronouns – Auxiliary verbs: e.g. can, should, may, be, have, do – Written-out numerals: e.g.: one, seven, first, seventh, thirty-nine, millionth – Degree modifiers: e.g.: very, quite, too • Such words are often/usually “function words”: have special roles in grammar—as arguably for all the above, in varying degrees, except for numerals. • Not completely fixed membership: e.g. – ordinary language evolution – invention of words for very large numbers – invention of quasi-numerals such as zillion and nth, ith, jth. Open Classes of Words • Classes of lexeme to which new members can be freely added, and often are. • Notably: – Nouns – Non-auxiliary Verbs – Adjectives – Adverbs – Interjections. • Some fairly recent examples: – tweet [verb & noun, related to Twitter communications: NB the lexical forms existed before, both as verb and noun], – treggings, globish [a newly arising global form of English], mobile [noun for phone], Blairite – remote [noun, short for remote control] – Newly invented proper nouns (or common nouns conscripted): Johnathan [new to me anyway], the Gherkin [a building in London], Marina Abigail Barnden What Now? • We’ll shift to talking about “morphology”, helped by our new knowledge of words and classes of them. • That will involve looking at how to compute the morphology of words. • After that we’ll be in a position to look at “POS tagging”: actually computing the POS of words in discourse. • To some extent at least POS tagging also includes, or can include, finding proper nouns, and intrinsically includes making the open/closed-class distinction.