Docstoc

MDS

Document Sample
MDS Powered By Docstoc
					Data Management and Linguistic Analysis:
         MDS applied to RODA

    Sheila M. Embleton, Dorin Uritescu
            & Eric S. Wheeler
       York University, Toronto, Canada
Order of Presentation
   Context
     Romanian and RODA
     RODA as Linguistic Technology
     Examples
          • Latin Word-final /u/
          • Non-palatalized dentals before front vowels
   MDS
     MDS as an analytic tool
     MDS and Romanian Dialects
Context
Romania




          Source:
          http://en.wikipedia.org/wiki/Roma
          nian_language#Geographic_distrib
          ution
Romanian




 22+ million speakers
 critical exemplar of eastern
  Romance language family
Noul Atlas lingvistic român. Crişana

   Crişana region in
    north-west
    Romania
   Hard copy atlas by
    Stan and Uritescu
    (1996, 2003)
   Digitize to make it
    more accessible
Objective
   Use Information Technology to
    permit a broad range of scholars to
     access the data,
     select the data appropriately, and

     present the data clearly;

and so gain greater understanding of
 its significance.
State of the Project           (Nov 2007)

   Have entered all 407 maps
    from Vol. I and II
     Twice proof-read
     Consulted source slips, when needed

 Have developed search and mapping
  tools to access the digital data
 Initial version now posted at:
    http://vpacademic.yorku.ca/romanian
RODA as
linguistic technology
The technology allows one to:
 View the data
 Search for data and count it
 Interpret the data or the counts
 Analyze the data (e.g. MDS)
 See the results as maps
     Save the maps as .jpg pictures
     Save the results for later use

   Hear samples of the data
RODA: function
    Custom-defined maps
     • You select the data
     • You see the result as a map
    Programmable access to the whole set
     of digitized data
     • You ask about data spread over many
       maps
     • You can customize what you search for
       (not just the editor‟s choice)
RODA: search of data
    Context of search becomes important
     • Word-final vs non-final vs either
     • Plain character vs accented character
     • Character vs (superposed) alternate
    Choice of fields to search
     • E.g. With nouns: sg. vs pl. entries
     • Variations heard by field workers
     • Flags to mark special situations (e.g.
       hesitation)
Examples from RODA
Crişana,
Romania
Crişana,
Romania

(from RODA)
Seeing Words Change
     Word-final /u/
     in Latin and
     non-Latin words
Word-final /u/ from Latin
    Latin        Romanian         Dialectal
                 (standard and    Variation
                 most dialects)

canto „I sing‟       cânt             cântu
                                   (vowel present)

                                      cântu
                                   (non-syllabic)


oculum „eye‟         ochi          ochiu
                                   ochiu
Is word-final /u/ random?
 Look for a geographic pattern over
  all potential occurrences
 The maps for single examples such
  as /ochi/ and others, are in the
  hard-copy dialect Atlas,
 But total data for all examples is
  spread widely over many maps.
Word-final
/u/


Data from:
•407 maps
•Field 1

Size of cross
shows the
number of
occurrences

Horizontal=
syllabic

Vertical =
non-syllabic
Syllabic and
non-syllabic
/u/
Data from:
•Selected maps
•Field 1
•Word-final or
non-word-final

Size of cross
shows the
number of
occurrences

Horizontal=
syllabic

Vertical =
non-syllabic
Word-final,
syllabic /u/


Data from:
•407 maps
•Field 1
•word-final only
•(horizontal =
vertical)

Locations 137,
141, 146 show
most examples
Word-final,
syllabic /u/


Can review the
data
Word-final,
syllabic /u/


Data from:
•selected maps
•Field 1
•word-final only
•removed non-
vocalic /u/ , def.
art., some
clusters +/u/.
•(horizontal =
vertical)

Locations 137,
141, 146 show
most examples
/u/ Pattern
   There is a pattern:
     Word final /u/ is retained in central, and
      north-eastern areas
     It is syllabic mostly in parts of the
      central area
     The locations with most frequent syllabic
      final /u/ do not form a continuous area
Dialect sub-regions
 Some locations have a given
  feature; others do not.
 On the basis of such (sometimes
  limited) examples, linguists posit the
  existence of dialect sub-regions.
 MDS analysis of “all” data raises
  questions about the nature of these
  sub-regions.
Non-palatalized dentals
before front vowels
Non-palatalized dentals before front vowels

 Crişana: dentals before front vowels
  are palatalized.
 Are they restructured as palatals?
     If the process is no longer productive,
      there may be non-palatalized dentals
      before front vowels.
     If so, where, in what forms and what is
      the frequency?
Non-palatalized dentals before front vowels
                              •Examples
                              everywhere.
                              •(As is well-known,
                              dentals are not
                              palatalized in Oaş,
                              except for 220.)

                              •Map shows
                              where and how
                              many examples.
Non-palatalized dentals before front vowels

 There are examples everywhere
  (not only in Oaş)
 Here we establish a result with the
  location and frequency of examples.
 Can view the examples that support
  the conclusion.
MDS
MDS as Analytic tool
 In addition to select, search, count
  and map functions, RODA can have
  special-purpose analytic tools.
 A built-in MDS tool allows us to
  create MDS maps based on any
  selected set of data.
 Other analytic techniques could also
  be implemented.
 MDS Process-1

Multidimensional
scaling (MDS)
uses the
“linguistic distance”
between n+1
locations to place
them in an
n-dimensional
space exactly...
 MDS Process-2

MDS projects an
n-space onto a
2-space (a map)
so that the
distances among
the points are
preserved as
best as possible.
Projection to 2-space
MDS Process -3
 The linguistic map may or may not
  correspond to geography
 It does give a high-level picture of
  the total linguistic relationship: All
  the data used to get the distances is
  now displayed as a single picture.
Distance measures
   Based on linguistic forms being
    “same” or “not same”
       Does not account for forms that are
        nearly the same:
        • “cat” ~ “caţ” ~ “feline”
       Missing forms are “not same”
   Summed over many comparisons
    MDS and dialects
   Embleton and
    Wheeler have
    used an MDS
    process on
       English dialects
       Finnish dialects
   Dialect roughly
    correlates with
    geography
Romanian Dialect groupings
 Begin with a hypothesis about
  dialect groupings in Crişana.
 Analyzed all data in 403 maps, using
  the MDS method.
     Identity is exact match; any difference
      is a difference of 1.
     Distance is sum of differences.

   We see the groupings on a map.
MDS map
All groups

   South-east
    and South-
    west are
    distinct.
   The rest are
    less so.
       Suggests
        the dialect
        unity of the
        region
   --> refine
    groupings
MDS map
Refined
groupings
      considerable
Still,
overlap or closeness
More  groups that could
be identified, e.g.:
    Several    divisions in
    West
Two      areas in Oaş
    Oaş is close to
    southern areas
    Still,its distinctness
    is clear (cf. also
    Uritescu 1984a).
MDS map
Refined
groupings
Crişana dialect regions
When a lot of data is considered:
 There is much overlap of regions
 A few regions are distinct.
It is possible that areas share features in a
  complex way, based on distance, physical
  geography and other factors.
There is more apparent unity than
  traditional analyses (based on a few
  features) would provide.
Further investigation
We want to look at:
 Differences in vocabulary (rare vs
  common terms)
 Phonetics vs morphology vs syntax
 Other definitions of distance
RODA and MDS
 RODA provides the large amount of
  data.
 MDS makes the large amount of
  data readily understandable as a
  single picture.
 Implementing MDS in RODA means
  that researchers can easily try the
  approach.
Summary
   RODA provides:
       Accessible data
       Flexible searching and custom presentation
       Repeatable processing
   MDS makes the data easy to visualize
   Result: new linguistic insights based on
    the greater understanding of the data
Contacts
 Sheila Embleton
embleton@yorku.ca
 Dorin Uritescu
dorinu@yorku.ca
 Eric Wheeler
wheeler@ericwheeler.ca

Site: vpacademic.yorku.ca/romanian/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:12/3/2011
language:English
pages:47