Docstoc

Compiling a Spoken Chinese Corpus of Situated Discourse

Document Sample
Compiling a Spoken Chinese Corpus of Situated Discourse Powered By Docstoc
					Compiling a Spoken Chinese Corpus of Situated Discourse
Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Corpora Overview
Spoken Chinese Corpora
A corpus of situated discourse A corpus of major dialects A corpus of speech

Written Chinese Corpora
A corpus of contemporary written Chinese A corpus of Pre-Qing written Chinese

Main headings
Components of the compiling process
1. 2. 3.

Real world discourse –what is it? Recording Encoding
Transcription (a) 2. Transcription (b) 3. Mark-up 4. Tagging
1.

4.

Application

(2b) Transcription for a special purpose

(2a) Character transcription
(3)

Mark-up

(0) ‘real world’ spoken discourse

(4) Coding

Recording (1) (5) Application

0
Discourse in the Real World

Spoken Chinese
No preparation
Topics pre-set with no preparation Topics preset with no written preparation Talking based on a written script Reading a written script

Single speake r

e.g. talk to oneself

e.g. narrate a personal story

e.g. oral exam

e.g. soliloquy, 1-person cross talk e.g. acting, cross talk

e.g. news reading, reading practice e.g. collective reciting

Two or more speake rs

*e.g. everyday talks

* e.g. sports saloon

*e.g. press interview

Real world situated discourse
 


   

(1) It is situated to an actual social situation; (2) It is situated to actual users; (3) It is situated to an inter-subjective world of discourse; (4) It is situated to actual goals; (5) It is situated to spatial and temporal setting; (6) It is situated to the cognitive capacity of actual users; (7) It is situated to performance contingencies of actual users who are engaged in spontaneous talking with little pre-planning.

F-staff
Academic ZWF

clerks
colleagues Staff meeting

Phone calls

visitors

Building X

Thurs

Mon
Tues
students

Wedn

Fri

Colleague 1 Academic Colleague 2

Other colleagues
visitors

Prjct team 2 Academic Prjct team 1 Prjct team 3 Building Z

Phone calls Building Y

Mon-Fri

Swimming pool markets wife son
Neighbours

kindergarten

Academic

Residential Building

Sat
Senior managers

Conference organizers Sports playmates

Sun

Academic Research center staff

Hotel staff

Summer Resort

Talking and Doing Interwoven in the Real World
1.

2.

3.

4.

5. 6. 7. 8.

Talking is the task, e.g., meeting, seminar, (it is task-oriented, task-goaldirected, segmented on the basis of the goal-attaining process. Note that turn-taking rules are based on such a type of talking-task relation) Talking is the main constitutive part of the task, some classroom discourse, doctor patient discourse (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process) Talking is a constitutive part of the task, e.g. giving instructions from time to time (task performance is dominant, talking tends to be fragmented) Tasking and task run in conflicting parallel, the achievement of the latter serves as a means to the goal of the former, e.g. business dinner (business table talk) (Note that segmenting this kind of talk can be based on the task) Talking is an embedded social part of the task, e.g. talking over the meal (talking has no specific goal to reach) Talking is a decorative part of the task, e.g., talking accompanying teamaking Talking is a hindrance to the task, e.g. talking over a written exam Talking and task are independent to each other

Micro performance analysis of five minute activities
Spatialtemporal Relations between acts Parallel and independent doing X helps himself with noodles Y sorts out the things on the table X sorts out the bowl and the chopsticks Y switches on the computer X sorts out the things on the table Y continues to sort out the things X starts to reinstall his computer Y starts to do the layout on computer X continues reinstalling Y continues doing the layout relation btwn doing & talking conflictive Parallel and independent Parallel and relevant X and Y talk about the journal editing Y talks to X about a politician X talks to Y about the Journal layout X continues talking to Y about the Journal editing talking

00:001:15

X and Y gossip

1:272:6

Parallel and independent

2:113:06

Parallel and independent

Parallel and independent

3:194:25

Parallel and independent

Parallel and relevant

4:344:40

Parallel and independent

Parallel and relevant

Sampling: Whose job?
Sinclair (1991:13) writes: The specification of a corpus --- the types and proportions of material in it --- is hardly a job for linguists at all, but more appropriate to the sociology of culture. The stance of the linguist should be a readiness to describe and analyse any instances of language placed before him or her. In the infancy of the discipline of corpus linguistics, the linguists have to do the text selection as well; when the impact of the work is felt more widely, it may be possible to hand over this responsibility to language-oriented social scientists.

The standard variety approach
it is arguable that Putonghua should be chosen as the target language to rule out other dialects from the picture. There are at least two major reasons for doing so. First, Putonghua serves as the standard language used by the media and education. Second, other spoken corpora have also adopted the standard variety.

Criticisms of the standard variety approach
Subject to serious criticisms relating to the reservation of the naturalness of language use. The standard variety is given its identity before the corpus is compiled. The corpus cannot be used to represent its naturalness, nor be used to establish or demonstrate its identity. … what the compilers believe what Putonghua looks like. Subjective judgment is also involved in sampling Putonghua speakers by filtering non-standard speakers out. … Unless they are ‘commissioned’ to talk among themselves, the activities the standard and non-standard interactants are engaged in have to be properly filtered as well.

The sampling: The workplace approach
It is true that situated discourses are unlimited in number. However, the types of social situations to which they are situated can be in theory exhaustively enlisted. According to the Beijing Yellow Book 1999, there are 67783 social work units which we divide them into 6 major categories and 31 subcategories,

6 major categories of social work units
01 Government, Parties and Other Social Bodies 02 Economical organizations 03 education, research and arts
4823 7.12%

53838 6840

79.43% 10.09%

04 health, sports, and social welfare
05 public welfare 06 military

1365
890 27

2.01%
1.46% 0.04%

descriptive title

no of mp3 files the total size
5 8 107 6 30 14 10 26 60 27 26 54 9 23,369,326 30,944,114 561,000,000 68,500,000 158,000,000 66,200,000 43,100,000 138,000,000 294,392,298 143,285,178 140,260,744 284,761,458 44,767,134

1 accident mediation 1 2 accident mediation 2 3 Administrative meetings 4 assessment meeting 5 auction 6 bfsu meeting 7 Birthday celebration 8 btvu seminar 9 bus talk 10 business negotiation 1 11 business negotiation 2 12 business negotiation 3 13 business negotiation 4

14 child discourse 163 1,115,063,560 15 Chinese and Korean first contact 7 34,708,716 16 Chinese New Year celebration 11 126,323,484 17 Classmates get-together 14 73,063,728 18 Classroom discourse-teach Chinese to Koreans 125 574,000,000 19 commercial house key-handling procedure 16 84,512,806 20 community talks 322 1,734,865,326 21 end year celebration 17 78,310,716 22 fortune telling 33 390,741,362 23 Gu yueguo a week record 248 1,235,679,186 24 house allocation meeting 44 239,388,838 25 house decoration team talks 36 181,660,952 26 Jiangsu TVU review meeting 11 49,675,918 27 kindergarten meeting 28 146,741,690 28 Lan Baochun family talks 22 285,975,640

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

lawsuit 93 508,628,422 lovers conversation 11 59,845,160 medical discourse 156 764,274,198 ministry education meeting 99 522,992,404 office talk ministry of communication 114 577,889,242 peasant family 73 373,917,094 Peking Univ ceremony 7 46,894,312 play mah-jong 28 145,754,884 private conversation 77 401,858,424 Radio Communication interviews 24 919,456,512 sell and buy 296 1,150,000,000 seventy-eighty yrs old peasant talks 22 125,624,138 street market shopping 37 190,887,972 student dormitory talks 66 345,920,582 table talks 89 529,995,698 visit blood doners 14 71,655,104 Zhu Rongji press conference 20 97,984,672 total (1second=15.6503KB) 2705 15,180,870,992=970005.11 sds/269.44 hrs

1 Recording

Recording
1. Who does the recording? 2. In what role does the person assume while recording? 3. What is the quality of the recording? 4. In what manner is the recording to be made? 5. How is the ethics of recording to be properly taken care? 6. What details are to be noted while recording? 7. How are the recordings to be kept safe?

In what role does the person assume while recording
The recording person as a legitimate observer: s/he is allowed by the authority to take non-active part in the activity and record the talk. S/he is an outsider. The party is aware of her or his presence and of her or his purpose of being there. The recording person as a genuine participant: s/he is an insider. The recording person as a surreptitious observer: s/he is one of the public members, and her or his presence draws no particular attention from anyone else.

In what manner is the recording to be made?
    

With the approval of all the participants With the approval of the key participant With the approval of the unit authority Open recording which can be noticed by anyone Surreptitiously

录音记录卡 录音人姓名: ________________ 性别: ______________ 职业: _______________________ 开 始 录 音 日 期 _____ 年 ____月____ 日 结 束 录 音 日 期 _____ 年 ____月____ 日 开 始 录 音 时 间: 上 午_____ 点 下 午_____ 点 晚 上 _____点 结 束 录 音 时 间: 上 午_____ 点 下 午_____ 点 晚 上 _____点 谈 话 地 点 _____ 省 _____市 ____ 县 ____ 乡 _____ 村 单位: ______________________________________________ 谈 话 场 所: 如 办 公 室、 朋 友 家、 餐 馆、 会议室、 超 市、 火 车 上、 车 间、 家 中 、 商 场、 医 院、 法 庭、 宾 馆、 街 上、 晚 会 上、 ___________ 在 录 本 面 磁 带 时 您 在 何 处?

1.

________________ 2. __________________ 3. ___________________ 公开 秘密 先秘密后公开 有些人知道并同意 都知道并同意

录音方式:

请 把 本 面 磁 带 的 谈 话 人 员 的 有 关 情 况 填 在 下 面 的 表 里 (越详细越好):
年 姓名 职 业、职称、职 务 龄 性 别 文化程 度 口音 与您以及和别的谈话人的关系

谈话目的和事由:_______________________________________________________________ _______________________________________________________________________________ _______________________________________________________________________________ 提 醒 您 本 面 录 完 后 要 检 查 一 下 磁 带 是 否 要 翻 面! (以下由语料库工作人员填写) -----------------------------------------------------------------------------------------------------------------------------原始声波文件名:_____________________ 汉字转写文件名: ____________________________ 原始声波文件光盘编号: ______________ 切分后声波文件名: __________________________ 归类文件夹名: ______________________ 其他: ______________________________________

How are the recordings to be kept safe?
The recordings on the 74 minute mini disks are all converted into wav files by using the recording function of the sound card. The format is 16 bits, stereo, 44100 Hz. The wav files are then stored on 640 mb recordable compact discs. They are further backed up by being converted into MP3 format (to economize on storage space) and saved again on separate 640 mb recordable compact discs. Furthermore, all the MP3 files are stalled on a USB movable 20G hard disk.

2

Transcription

The encoding process
1. 2.

3. 4.

5.

Transcription in Chinese characters Transcription in Pinying/IPA symbols Transcription by using Praat Mark-up by XML Tagging

Issues in segmentation
Segmenting sound streams into orthographic and phonetic linear units is the first major concern of the present project. It proves to be theoretically significant and practically difficult. The only natural unit boundaries are speaker-turns (turn defined in terms of the speaker’s presence of phonation). The other units either larger or smaller than turns tend to be more like theoretical constructs than otherwise.

Basic unit ---?
Acoustically speaking, a spontaneous talk is a sequence of strings of sounds uttered by two or more speakers. Prosodic or intonational units seem to be natural segments of the sequence. They are treated as basic units of talk and seem to have the same status as sentence does in written text. The weaknesses of such segmentation are (1) segments larger than intonational units are assumed to be the mere stacking of these basic units, which are untrue, hence misleading; and (2) talk is treated as a self-contained product waiting to be sliced into intonational units, thus ignoring the dynamic aspect of talk and its intrinsic relation with the social activities at large.

Multiple level segmentation 1
The first-level segment: The activity boundary (segmenting talk from other social activities)  Schedule boundary, e.g. a two-hour meeting, classroom discourse  Visit boundary, e.g. a patient’s visit to a doctor  Case boundary, an accident settlement  Appointment boundary, e.g.  Business boundary, e.g. buy something

Multiple level segmentation 2
The second-level segment: goal-oriented segmentation (segmenting talk into goal-attaining chunks)  The segmentation is made on the basis of goalattaining process – goal-attainment structure  E.g., Opening, negotiating, closing of a meeting  E.g., examine-diagnose-prescribe-recommending  The presentation of a speaker

Multiple level segmentation 3
The third-level segment: turn-oriented segmentation  (segmenting goal-attaining chunks into turn-taking chunks)  The segmentation is made on the basis of turn-boundary

Multiple level segmentation 4
The fourth-level segment: functional units (segmenting turn-taking chunks into functional units) The segmentation is made on the basis of functional markers or clues. • A meaningful cluster with a clear forward function • A meaningful cluster with a clear backward function • A meaningful cluster with a clear downward function • A meaningful cluster having a clear cognitive function: planning or searching for words

Multiple level segmentation 5 The fifth level segment: linear character and phonetic units

Natural growth and development of language
Trajectories of life path Trajectories of life path

Trajectories of life path
Internalized language out of life path trajectories

Trajectories of life path
Internalized language out of life path trajectories

Trajectories of life path

Trajectories of life path
Internalized language out of life path trajectories

Linguistic theory as reconstruction as modeling as description as standardization
Trajectories of life path

Trajectories of life path

Trajectories of life path
Internalized language out of life path trajectories

Trajectories of life path
Internalized language out of life path trajectories

Trajectories of life path

Trajectories of life path
Internalized language out of life path trajectories


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:11/25/2009
language:English
pages:36