Docstoc

MDS

Document Sample
MDS Powered By Docstoc
					Data Management and Linguistic Analysis:
         MDS applied to RODA

    Sheila M. Embleton, Dorin Uritescu
            & Eric S. Wheeler
       York University, Toronto, Canada
Order of Presentation
n   Context
    n Romanian and RODA
    n RODA as Linguistic Technology
    n Examples
          • Latin Word-final /u/
          • Non-palatalized dentals before front vowels
n   MDS
    n MDS as an analytic tool
    n MDS and Romanian Dialects
Context
Romania




          Source:
          http://en.wikipedia.org/wiki/Roma
          nian_language#Geographic_distrib
          ution
Romanian




n 22+ million speakers
n critical exemplar of eastern
  Romance language family
Noul Atlas lingvistic român. Crişana

n   Crişana region in
    north-west
    Romania
n   Hard copy atlas by
    Stan and Uritescu
    (1996, 2003)
n   Digitize to make it
    more accessible
Objective
n   Use Information Technology to
    permit a broad range of scholars to
    n access the data,
    n select the data appropriately, and

    n present the data clearly;

and so gain greater understanding of
 its significance.
State of the Project           (Nov 2007)

n   Have entered all 407 maps
    from Vol. I and II
    n Twice proof-read
    n Consulted source slips, when needed

n Have developed search and mapping
  tools to access the digital data
n Initial version now posted at:
    http://vpacademic.yorku.ca/romanian
RODA as
linguistic technology
The technology allows one to:
n View the data
n Search for data and count it
n Interpret the data or the counts
n Analyze the data (e.g. MDS)
n See the results as maps
    n Save the maps as .jpg pictures
    n Save the results for later use

n   Hear samples of the data
RODA: function
 n   Custom-defined maps
     • You select the data
     • You see the result as a map
 n   Programmable access to the whole set
     of digitized data
     • You ask about data spread over many maps
     • You can customize what you search for
       (not just the editor’s choice)
RODA: search of data
 n   Context of search becomes important
     • Word-final vs non-final vs either
     • Plain character vs accented character
     • Character vs (superposed) alternate
 n   Choice of fields to search
     • E.g. With nouns: sg. vs pl. entries
     • Variations heard by field workers
     • Flags to mark special situations (e.g.
       hesitation)
Examples from RODA
Crişana,
Romania
Crişana,
Romania

(from RODA)
Seeing Words Change
     Word-final /u/
     in Latin and
     non-Latin words
Word-final /u/ from Latin
    Latin        Romanian         Dialectal
                 (standard and    Variation
                 most dialects)

canto ‘I sing’       cânt             cântu
                                   (vowel present)

                                      cântu
                                   (non-syllabic)


oculum ‘eye’         ochi          ochiu
                                   ochiu
Is word-final /u/ random?
n Look for a geographic pattern over
  all potential occurrences
n The maps for single examples such
  as /ochi/ and others, are in the hard
  -copy dialect Atlas,
n But total data for all examples is
  spread widely over many maps.
Word-final
/u/


Data from:
•407 maps
•Field 1

Size of cross
shows the
number of
occurrences

Horizontal=
syllabic

Vertical =
non-syllabic
Syllabic and
non-syllabic
/u/
Data from:
•Selected maps
•Field 1
•Word-final or
non-word-final

Size of cross
shows the
number of
occurrences

Horizontal=
syllabic

Vertical =
non-syllabic
Word-final,
syllabic /u/


Data from:
•407 maps
•Field 1
•word-final only
•(horizontal =
vertical)

Locations 137,
141, 146 show
most examples
Word-final,
syllabic /u/


Can review the
data
Word-final,
syllabic /u/


Data from:
•selected maps
•Field 1
•word-final only
•removed non-
vocalic /u/ , def.
art., some
clusters +/u/.
•(horizontal =
vertical)

Locations 137,
141, 146 show
most examples
/u/ Pattern
n   There is a pattern:
    n Word final /u/ is retained in central, and
      north-eastern areas
    n It is syllabic mostly in parts of the
      central area
    n The locations with most frequent syllabic
      final /u/ do not form a continuous area
Dialect sub-regions
n Some locations have a given
  feature; others do not.
n On the basis of such (sometimes
  limited) examples, linguists posit the
  existence of dialect sub-regions.
n MDS analysis of “all” data raises
  questions about the nature of these
  sub-regions.
Non-palatalized dentals
before front vowels
Non-palatalized dentals before front vowels

n Crişana: dentals before front vowels
  are palatalized.
n Are they restructured as palatals?
    n If the process is no longer productive,
      there may be non-palatalized dentals
      before front vowels.
    n If so, where, in what forms and what is
      the frequency?
Non-palatalized dentals before front vowels
                              •Examples
                              everywhere.
                              •(As is well-known,
                              dentals are not
                              palatalized in Oaş,
                              except for 220.)

                              •Map shows
                              where and how
                              many examples.
Non-palatalized dentals before front vowels

n There are examples everywhere
  (not only in Oaş)
n Here we establish a result with the
  location and frequency of examples.
n Can view the examples that support
  the conclusion.
MDS
MDS as Analytic tool
n In addition to select, search, count
  and map functions, RODA can have
  special-purpose analytic tools.
n A built-in MDS tool allows us to
  create MDS maps based on any
  selected set of data.
n Other analytic techniques could also
  be implemented.
 MDS Process-1

Multidimensional
scaling (MDS)
uses the
“linguistic distance”
between n+1
locations to place
them in an
n-dimensional
space exactly...
 MDS Process-2

MDS projects an
n-space onto a 2
-space (a map)
so that the
distances among
the points are
preserved as
best as possible.
Projection to 2-space
MDS Process -3
n The linguistic map may or may not
  correspond to geography
n It does give a high-level picture of
  the total linguistic relationship: All
  the data used to get the distances is
  now displayed as a single picture.
Distance measures
n   Based on linguistic forms being
    “same” or “not same”
    n   Does not account for forms that are
        nearly the same:
        • “cat” ~ “caţ” ~ “feline”
    n   Missing forms are “not same”
n   Summed over many comparisons
    MDS and dialects
n   Embleton and
    Wheeler have
    used an MDS
    process on
    n   English dialects
    n   Finnish dialects
n   Dialect roughly
    correlates with
    geography
Romanian Dialect groupings
n Begin with a hypothesis about
  dialect groupings in Crişana.
n Analyzed all data in 403 maps, using
  the MDS method.
    n Identity is exact match; any difference
      is a difference of 1.
    n Distance is sum of differences.

n   We see the groupings on a map.
MDS map
All groups

n   South-east
    and South-
    west are
    distinct.
n   The rest are
    less so.
    n   Suggests
        the dialect
        unity of the
        region
n   --> refine
    groupings
MDS map
Refined
groupings
      considerable
nStill,
overlap or closeness
nMore  groups that could
be identified, e.g.:
    nSeveral    divisions in
    West
nTwo      areas in Oaş
    nOaş is close to
    southern areas
    nStill,its distinctness
    is clear (cf. also
    Uritescu 1984a).
MDS map
Refined
groupings
Crişana dialect regions
When a lot of data is considered:
n There is much overlap of regions
n A few regions are distinct.
It is possible that areas share features in a
  complex way, based on distance, physical
  geography and other factors.
There is more apparent unity than
  traditional analyses (based on a few
  features) would provide.
Further investigation
We want to look at:
n Differences in vocabulary (rare vs
  common terms)
n Phonetics vs morphology vs syntax
n Other definitions of distance
RODA and MDS
n RODA provides the large amount of
  data.
n MDS makes the large amount of
  data readily understandable as a
  single picture.
n Implementing MDS in RODA means
  that researchers can easily try the
  approach.
Summary
n   RODA provides:
    n   Accessible data
    n   Flexible searching and custom presentation
    n   Repeatable processing
n   MDS makes the data easy to visualize
n   Result: new linguistic insights based on
    the greater understanding of the data
Contacts
n Sheila Embleton
embleton@yorku.ca
n Dorin Uritescu
dorinu@yorku.ca
n Eric Wheeler
wheeler@ericwheeler.ca

Site: vpacademic.yorku.ca/romanian/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:11/22/2013
language:Unknown
pages:47