Embed
Email

AFRICAN SPEECH TECHNOLOGY TECHNICAL REPORT 26 Perl Scripts for AST ...

Document Sample

Shared by: liamei12345
Categories
Tags
Stats
views:
0
posted:
10/20/2011
language:
English
pages:
78
AFRICAN SPEECH TECHNOLOGY



TECHNICAL REPORT 26









Perl Scripts for AST Database Validation,

Quality Control, and Manipulation.









November 2003









DACST Innovation Fund Project 21213

Consortium Members: University of Stellenbosch, University of Pretoria, West Technology Holdings Pty Ltd,

South African Foundation for Language and Speech Technology Development



Research Unit for Experimental Phonology, University of Stellenbosch, Private Bag X1, Matieland 7602, South Africa.

E-mail: (Administrative) jcr@maties.sun.ac.za (Technical): botha@sun.ac.za / dupreez@dsp.sun.sc.za

Tel +27 (0)21 8082106 Fax +21 (0)21 8083975 http://www.ast.sun.ac.za

AST Confidential Page 2 of 78









Identification number DACST 2193 (AST) T26



Type Technical Report

Title Perl Scripts for AST Database Validation, Quality Control, and

Manipulation.

Status Final

Date November 2003



Version 1.0



Number of pages 78



Author(s) M.W. Theunissen, mtheunis@sun.ac.za



Project co-ordinator Justus Roux

e-mail jcr@maties.sun.ac.za

http://www.ast.sun.ac.za

Access Confidential



Key words

Abstract This document takes a look at the Perl scripts that were

developed for AST database validation,

quality control, and manipulation. A description of each script

will be given. In addition to this, it will be explained how to use

these script in conjunction with each other.

Actual Distribution

Supplementary notes







DOCUMENT EVOLUTION



Version Date Status Notes

1.0 November Final

2003









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 3 of 78



Contents



1. Introduction ....................................................................................................................7

2. Description of AST Scripts ...........................................................................................7

2.1 Assimilations.pl ...................................................................................................................................................... 7

2.1.1 Overview.......................................................................................................................................................... 7

2.1.2 Command Line Options ................................................................................................................................ 7

2.1.3 Questions ........................................................................................................................................................ 7

2.1.4 Generated Output Files ................................................................................................................................. 7

2.2 BottomToTop.pl ..................................................................................................................................................... 8

2.2.1 Overview.......................................................................................................................................................... 8

2.2.2 Command Line Options ................................................................................................................................ 8

2.2.3 Questions ........................................................................................................................................................ 9

2.2.4 Generated Output Files ................................................................................................................................. 9

2.3 BuildLex.pl .............................................................................................................................................................. 9

2.3.1 Overview.......................................................................................................................................................... 9

2.3.2 Command Line Options ................................................................................................................................ 9

2.3.3 Questions ...................................................................................................................................................... 10

2.3.4 Files Needed in Startup Directory.............................................................................................................. 18

2.3.5 Generated Output Files ............................................................................................................................... 18

2.4 CheckAndReplaceOrtNames.pl ........................................................................................................................ 18

2.4.1 Overview........................................................................................................................................................ 18

2.4.2 Command Line Options .............................................................................................................................. 19

2.4.3 Generated Output Files ............................................................................................................................... 19

2.5 CheckForErrors.pl ............................................................................................................................................... 19

2.5.1 Overview........................................................................................................................................................ 19

2.5.2 Command Line Options .............................................................................................................................. 19

2.5.3 Questions ...................................................................................................................................................... 20

2.5.4 Files Needed in Startup Directory.............................................................................................................. 20

2.5.5 Generated Output Files ............................................................................................................................... 21

2.6 CheckForInvalidIntervals.pl................................................................................................................................ 21

2.6.1 Overview........................................................................................................................................................ 21

2.6.2 Command Line Options .............................................................................................................................. 21

2.6.3 Generated Output Files ............................................................................................................................... 22

2.7 CheckLexicon.pl .................................................................................................................................................. 22

2.7.1 Overview........................................................................................................................................................ 22

2.7.2 Questions ...................................................................................................................................................... 22

2.7.3 Files Needed in Startup Directory.............................................................................................................. 22

2.7.4 Generated Output Files ............................................................................................................................... 22

2.8 CheckNumberOfALawsAndTextGrids.pl.......................................................................................................... 24

2.8.1 Overview........................................................................................................................................................ 24

2.8.2 Command Line Options .............................................................................................................................. 24

2.8.3 Generated Output Files ............................................................................................................................... 24

2.9 Cleanup.pl............................................................................................................................................................. 25

2.9.1 Overview........................................................................................................................................................ 25

2.9.2 Command Line Options .............................................................................................................................. 25

2.9.3 Questions ...................................................................................................................................................... 25

2.9.4 Generated Output Files ............................................................................................................................... 25

2.10 CompareAlawAndTextGridFilenames.pl........................................................................................................ 25

2.10.1 Overview ..................................................................................................................................................... 25

2.10.2 Command Line Options ............................................................................................................................ 26

2.10.3 Generated Output Files............................................................................................................................. 26

2.11 Converter.pl........................................................................................................................................................ 26

2.11.1 Overview ..................................................................................................................................................... 26

2.11.2 Command Line Options ............................................................................................................................ 26

2.11.3 Questions .................................................................................................................................................... 27

2.11.4 Files Needed in Startup Directory............................................................................................................ 27

2.11.5 Generated Output Files............................................................................................................................. 27

2.12 CopyAssimErrorFilesToFWE.pl ...................................................................................................................... 27

2.12.1 Overview ..................................................................................................................................................... 27

2.12.2 Command Line Options ............................................................................................................................ 28

2.12.3 Questions .................................................................................................................................................... 28



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 4 of 78



2.13 CopyPhoneCorrectedFiles.pl........................................................................................................................... 29

2.13.1 Overview ..................................................................................................................................................... 29

2.13.2 Command Line Options ............................................................................................................................ 29

2.13.3 Questions .................................................................................................................................................... 29

2.14 CopySNBDErrorFilesToFWE.pl ...................................................................................................................... 29

2.14.1 Overview ..................................................................................................................................................... 29

2.14.2 Command Line Options ............................................................................................................................ 29

2.14.3 Questions .................................................................................................................................................... 30

2.15 DeleteRedundantALTFilesLookingAtAttribs.pl.............................................................................................. 32

2.15.1 Overview ..................................................................................................................................................... 32

2.15.2 Command Line Options ............................................................................................................................ 32

2.16 ExtractTranscriptions.pl.................................................................................................................................... 33

2.16.1 Overview ..................................................................................................................................................... 33

2.16.2 Command Line Options ............................................................................................................................ 33

2.16.3 Questions .................................................................................................................................................... 33

2.16.4 Generated Output Files............................................................................................................................. 33

2.17 GenerateLabFiles.pl.......................................................................................................................................... 35

2.17.1 Overview ..................................................................................................................................................... 35

2.17.2 Command Line Options ............................................................................................................................ 35

2.17.3 Files Needed in Startup Directory............................................................................................................ 36

2.17.4 Generated Output Files............................................................................................................................. 36

2.18 GenOrt.pl ............................................................................................................................................................ 38

2.18.1 Overview ..................................................................................................................................................... 38

2.18.2 Command Line Options ............................................................................................................................ 38

2.18.3 Questions .................................................................................................................................................... 39

2.18.4 Generated Output Files............................................................................................................................. 39

2.19 GetPronouncedAcronyms.pl............................................................................................................................ 39

2.19.1 Overview ..................................................................................................................................................... 39

2.19.2 Command Line Options ............................................................................................................................ 39

2.19.3 Questions .................................................................................................................................................... 39

2.19.4 Generated Output Files............................................................................................................................. 40

2.20 GetTextGridDirectoriesRecursively.pl ............................................................................................................ 40

2.20.1 Overview ..................................................................................................................................................... 40

2.20.2 Command Line Options ............................................................................................................................ 40

2.20.3 Generated Output Files............................................................................................................................. 40

2.21 Lin2Dos.pl........................................................................................................................................................... 40

2.21.1 Overview ..................................................................................................................................................... 40

2.21.2 Command Line Options ............................................................................................................................ 41

2.22 MoveDuplicateAttribs.pl.................................................................................................................................... 41

2.22.1 Overview ..................................................................................................................................................... 41

2.22.2 Command Line Options ............................................................................................................................ 41

2.22.3 Questions .................................................................................................................................................... 41

2.23 MovePhoneticTextGrids.pl............................................................................................................................... 41

2.23.1 Overview ..................................................................................................................................................... 41

2.23.2 Command Line Options ............................................................................................................................ 42

2.24 NumberOfMalesAndFemales.pl ...................................................................................................................... 42

2.24.1 Overview ..................................................................................................................................................... 42

2.24.2 Command Line Options ............................................................................................................................ 42

2.25 ProcessRawTranscriptionData.pl.................................................................................................................... 42

2.25.1 Overview ..................................................................................................................................................... 42

2.25.2 Command Line Options ............................................................................................................................ 43

2.25.3 Questions .................................................................................................................................................... 43

2.25.4 Generated Output Files............................................................................................................................. 43

2.26 RemoveDuplicateAttrib.pl................................................................................................................................. 43

2.26.1 Overview ..................................................................................................................................................... 43

2.26.2 Command Line Options ............................................................................................................................ 44

2.26.3 Questions .................................................................................................................................................... 44

2.27 RemoveInvalidIntervals.pl................................................................................................................................ 44

2.27.1 Overview ..................................................................................................................................................... 44

2.27.2 Command Line Options ............................................................................................................................ 44

2.27.3 Generated Output Files............................................................................................................................. 45

2.28 RemoveTextGridFilesWithErrors.pl ................................................................................................................ 45

2.28.1 Overview ..................................................................................................................................................... 45



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 5 of 78



2.28.2 Command Line Options ............................................................................................................................ 45

2.28.3 Questions .................................................................................................................................................... 45

2.29 Renamer.pl......................................................................................................................................................... 46

2.29.1 Overview ..................................................................................................................................................... 46

2.29.2 Command Line Options ............................................................................................................................ 46

2.29.3 Questions .................................................................................................................................................... 46

2.30 ReplaceAssimilations.pl ................................................................................................................................... 46

2.30.1 Overview ..................................................................................................................................................... 46

2.30.2 Command Line Options ............................................................................................................................ 46

2.30.3 Questions .................................................................................................................................................... 47

2.30.4 Generated Output Files............................................................................................................................. 48

2.31 Rip.pl ................................................................................................................................................................... 48

2.31.1 Overview ..................................................................................................................................................... 48

2.31.2 Command Line Options ............................................................................................................................ 48

2.31.3 Questions .................................................................................................................................................... 48

2.31.4 Files Needed in Startup Directory............................................................................................................ 49

2.31.5 Generated Output Files............................................................................................................................. 49

2.32 SubstitutePhonCharacters.pl........................................................................................................................... 51

2.32.1 Overview ..................................................................................................................................................... 51

2.32.2 Command Line Options ............................................................................................................................ 51

2.32.3 Files Needed in Startup Directory............................................................................................................ 52

2.32.4 Generated Output Files............................................................................................................................. 52

2.33 Transcribe.pl ...................................................................................................................................................... 52

2.33.1 Overview ..................................................................................................................................................... 52

2.33.2 Command Line Options ............................................................................................................................ 52

2.33.3 Questions .................................................................................................................................................... 53

2.33.4 Files Needed in Startup Directory............................................................................................................ 54

2.33.5 Generated Output Files............................................................................................................................. 54

2.34 WordsWithInternalZeros.pl .............................................................................................................................. 54

2.34.1 Overview ..................................................................................................................................................... 54

2.34.2 Command Line Options ............................................................................................................................ 54

2.34.3 Questions .................................................................................................................................................... 55

2.34.4 Generated Output Files............................................................................................................................. 55

2.35 WorkThroughFWEErrorFiles.pl ....................................................................................................................... 55

2.35.1 Overview ..................................................................................................................................................... 55

2.35.2 Command Line Options ............................................................................................................................ 56

2.35.3 Questions .................................................................................................................................................... 56

2.35.4 Programs to Install..................................................................................................................................... 57

2.35.5 Files Needed during Startup..................................................................................................................... 57

2.35.6 Generated Output Files............................................................................................................................. 57

2.36 WorkThroughLex.pl........................................................................................................................................... 57

2.36.1 Overview ..................................................................................................................................................... 57

2.36.2 Questions .................................................................................................................................................... 57

2.36.3 Programs to Install..................................................................................................................................... 58

2.36.4 Files Needed in Startup Directory............................................................................................................ 58

2.36.5 Generated Output Files............................................................................................................................. 59

3. Using the Scripts .........................................................................................................60

3.1 Processing Raw Orthographic Transcription Files ......................................................................................... 60

3.2 Generating Deterministic Phonetic Transcriptions ......................................................................................... 61

3.3 Processing Phonetically Corrected Data ......................................................................................................... 62

3.4 Merging Phonetically Corrected Batches of Data........................................................................................... 62

3.5 Correcting More Errors ....................................................................................................................................... 63

3.5.1 Sil/Nonsil/Boundary/Duration Errors.......................................................................................................... 63

3.5.2 Assimilation Errors ....................................................................................................................................... 64

3.5.3 Lexicon Errors............................................................................................................................................... 65

3.6 Final Processing of Information......................................................................................................................... 66

Appendix A – PhoneSets.lst ...........................................................................................67

Appendix B – Sesotho Transcription Rules ..................................................................71

Appendix C – Xhosa Transcription Rules......................................................................73

Appendix D – Zulu Transcription Rules.........................................................................76





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 6 of 78



Acronyms



ALT Alaw/Lab/TextGrid

AST African Speech Technology

CFE CheckForErrors.pl

CIS Case Insensitive Search

CSS Case Sensitive Search

FWE FilesWithErrors

SNB Sil/Nonsil/Boundary

SNBD Sil/Nonsil/Boundary/Duration









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 7 of 78



1. INTRODUCTION



In this report, a short description will be given of each of the Perl scripts that were used for

AST database validation, quality control, and manipulation. In addition to this, a list will be

given showing the order in which scripts should be run.





2. DESCRIPTION OF AST SCRIPTS



2.1 Assimilations.pl



2.1.1 Overview



Extracts assimilations (#1..#2..#3) from orthographic transcriptions. TextGrid files are

allowed to contain only orthographic transcriptions, or orthographic and phonetic

transcriptions.



2.1.2 Command Line Options



1) Assimilations.pl



Will extract assimilations from the TextGrid files that can be found in the directories

which are specified in the list file DirectoryList.lst.



2) Assimilations.pl -f



Will only extract assimilations from the specified TextGrid file.



3) Assimilations.pl -d



Will only extract assimilations from the specified directory's TextGrid files.



4) Assimilations.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.1.3 Questions



1) For this script to work properly, the data must be free of normal CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)

errors1. Otherwise, answer “n” in order to abort the program.



2.1.4 Generated Output Files



If one or more errors are found in the TextGrids, the following file will be created in the

original startup directory before program execution is stopped:







1

Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 8 of 78



• DirectoriesWithErrors.txt



Holds the names of all the directories in which TextGrid files with errors were found.

Note that this file will only be created if working in one or more directories.



In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)

subdirectory will be created in each affected directory that will hold the TextGrid files

containing the errors. Their alaw files (lying in the working directory or alaw directory) will

be copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the

FWE folder and will contain an error message for each affected TextGrid file. Note that to

work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).



If assimilations are found, the following files will be created in the original startup directory:



• Assimilations.txt



Holds the list of all the unique assimilations that were found in the TextGrid files.



• TextGridFilesContainingAssimilations.txt



Holds the list of all TextGrid files containing assimilations. In addition to this, the

assimilations associated with each TextGrid will be displayed. Note that this file will

only be created if looking for assimilations in one or more directories.





2.2 BottomToTop.pl



2.2.1 Overview



Orthographic intervals are modified in order to correspond with phonetic intervals. Only

those intervals falling outside utterance and sentence markers will be updated. Note that

the TextGrid files must contain orthographic and phonetic transcriptions.



2.2.2 Command Line Options



1) BottomToTop.pl



Will update the TextGrid files that can be found in the directories which are

specified in the list file DirectoryList.lst.



2) BottomToTop.pl -f



Will only update the specified TextGrid file.



3) BottomToTop.pl -d



Will only update the specified directory's TextGrid files.



4) BottomToTop.pl -l



The newly specified list file will be used instead of DirectoryList.lst.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 9 of 78



2.2.3 Questions



1) For this script to work properly, the data must be free of normal CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)

errors2. Otherwise, answer “n” in order to abort the program.



2.2.4 Generated Output Files



If one or more errors are found in the TextGrids, the following file will be created in the

original startup directory before program execution is stopped:



• DirectoriesWithErrors.txt



Holds the names of all the directories in which TextGrid files with errors were found.

Note that this file will only be created if working in one or more directories.



In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)

subdirectory will be created in each affected directory that will hold the TextGrid files

containing the errors. Their alaw files (lying in working directory or alaw directory) will be

copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the

FWE folder and will contain an error message for each affected TextGrid file. Note that to

work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).



If one or more directories were modified the following file will be created in the original

startup directory:



• DirectoriesWithUpdates.txt



Holds the names of all the directories in which TextGrid files were updated. Note

that this file will only be created if working in one or more directories.





2.3 BuildLex.pl



2.3.1 Overview



Builds VAR lexicon3 from the orthographic and phonetic information contained within a

batch of TextGrid files. TextGrids must therefore contain both orthographic and phonetic

transcriptions.



2.3.2 Command Line Options



1) BuildLex.pl



Will build lexicon from TextGrid files that can be found in the directories which are

specified in the list file DirectoryList.lst.



2

Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.

3

A VAR lexicon can contain various phonetic sequences for each orthographic word. The phonetic sequences of an

orthographic word are separated with commas.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 10 of 78





2) BuildLex.pl -d



Will only build lexicon from the specified directory's TextGrid files.



3) BuildLex.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.3.3 Questions



1) For this script to work properly, the data must be free of ALL CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of all CFE (CheckForErrors.pl) errors4.

Otherwise, answer “n” in order to abort the program.



2) For this script to work properly, the data must be in XSAMPA format.

Do you want to continue? (y/n)



Answer “y” to this question if the data is in XSAMPA format. Otherwise, answer “n”

in order to abort the program. Note that if the phonetic transcriptions are in Praat

format use Converter.pl (Section 2.11) to change it to XSAMPA.



3) Please enter the name of the VAR lexicon.



This file will hold the extracted VAR lexicon. An example of a section of such a

lexicon is shown below:



.

.

.

but=[b V t,b a t,b V,b @,b @ t]

but;its=[b V 4 I t s]

but;the=[b V t @,b V d @,b V D @,b @ t @,b a d @,b @ d @,b V D I,b V t I,b a

t @,b V t @ ?,b V d E:,b V t 3:,b V t ?,b V 4 @,b V d I,b V t]

but;the;*elen+=[b a d 9 l I n]

but;the;elementary=[b V d @ { l @ m E n t R\ I,b V D { l @ m E n t r\ i:,b V D @ l

@ m E n t r\ I,b V D E l @ m E n t r\ i,b V D E l @ m E n t @ r\ i:,b V d @ l @ m

E n t r\ I,b D E: l @ m E n t r\ y,b a D E l @ m E n t r i:,b a D E l @ m E n t r\ i,b

V d E: l @ m @ n t r\ i:]

butterfly=[b V t f V-\I,b V t @ f l V-\I]

butterscotch=[b a t @ r s k Q t-\S,b a t @ s k Q t-\S]

buy=[b a-\i]

buying=[b a-\i j I N]

by=[b V-\I,b a-\i,b @-\i,b a-\I]

by;a=[b V-\I]

ca=[k { ?]





4

All CFE errors include a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 11 of 78





cab;broke=[k {: b r\ @-\u k]

cafeteria=[k { f @ t I-\@ r\ I j @]

calf=[k A: f]

call=[k O: l,k O:,k O l]

.

.

.





Note that the phones are separated by spaces and that commas are used to

separate the different phonetic representations of an orthographic word. An

orthographic word’s most frequent phonetic sequence will occur first in the list, while

the least frequent one will be written last.



The extracted lexicon can be checked using CheckLexicon.pl (Section 2.7).



4) Please enter the name of the a priori lexicon.



This file will hold the a priori probabilities of each phonetic sequence in the lexicon.

The data will be saved in the same structure used to store the VAR lexicon.

Therefore, the a priori probabilities will be written to the locations where the

phonetic sequences would normally have stood. For the section of the extracted AE

VAR lexicon above, the a priori lexicon looks as follows:



.

.

.

but=[0.00190174326465927,0.00135587251276633,0.000792393026941363,0.

000158478605388273,0.000158478605388273]

but;its=[1.76087339320303e-005]

but;the=[0.000563479485824969,0.000158478605388273,0.000105652403592

182,7.04349357281212e-005,5.28262017960909e-005,3.52174678640606e-

005,3.52174678640606e-005,3.52174678640606e-005,1.76087339320303e-

005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-

005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-

005,1.76087339320303e-005]

but;the;*elen+=[1.76087339320303e-005]

but;the;elementary=[1.76087339320303e-005,1.76087339320303e-

005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-

005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-

005,1.76087339320303e-005,1.76087339320303e-005]

butterfly=[1.76087339320303e-005,1.76087339320303e-005]

butterscotch=[3.52174678640606e-005,1.76087339320303e-005]

buy=[1.76087339320303e-005]

buying=[1.76087339320303e-005]

by=[0.000281739742912485,0.000123261137524212,5.28262017960909e-

005,3.52174678640606e-005]

by;a=[1.76087339320303e-005]

ca=[1.76087339320303e-005]







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 12 of 78





cab=[1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-

005]

cab;broke=[1.76087339320303e-005]

cafeteria=[1.76087339320303e-005]

calf=[1.76087339320303e-005]

call=[0.000669131889417151,8.80436696601514e-005,1.76087339320303e-

005]

.

.

.





Note that this file cannot be checked with CheckLexicon.pl.



5) Please enter the name of the counts lexicon.



This file will hold the phonetic sequence counts - the number of times each phonetic

sequence occurs in the TextGrid files. The data will be saved in the same structure

used to store the VAR lexicon. Therefore, the counts will be written to the locations

where the phonetic sequences would normally have stood. For the section of the

extracted AE VAR lexicon above, the counts lexicon looks as follows:



.

.

.

but=[108,77,45,9,9]

but;its=[1]

but;the=[32,9,6,4,3,2,2,2,1,1,1,1,1,1,1,1]

but;the;*elen+=[1]

but;the;elementary=[1,1,1,1,1,1,1,1,1,1]

butterfly=[1,1]

butterscotch=[2,1]

buy=[1]

buying=[1]

by=[16,7,3,2]

by;a=[1]

ca=[1]

cab=[1,1,1]

cab;broke=[1]

cafeteria=[1]

calf=[1]

call=[38,5,1]

.

.

.



Note that this file cannot be checked with CheckLexicon.pl.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 13 of 78



6) Please enter the name of the file that will hold the location info.



This file will hold the information that will be needed to locate the TextGrid files that

are associated with each phonetic sequence in the VAR lexicon. For the section of

the extracted AE VAR lexicon above, the location information is shown below:



.

.

.

but bVt K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE003LM034.TextGrid AE005LM024.TextGrid AE006LF027.TextGrid

AE007LF034.TextGrid AE012CM040.TextGrid AE016CF024.TextGrid

AE017CF033.TextGrid AE018CF040.TextGrid AE024LM032.TextGrid

AE026CF022.TextGrid AE028CF032.TextGrid AE028LF032.TextGrid

AE035LM033.TextGrid AE042CM029.TextGrid AE046LF024.TextGrid

AE048LF021.TextGrid AE055CM021.TextGrid AE066LF039.TextGrid

AE067LF034.TextGrid AE068LF019.TextGrid AE070LF035.TextGrid

AE075CM026.TextGrid AE081LM025.TextGrid AE087LF040.TextGrid

AE088LF035.TextGrid AE089LF032.TextGrid AE094CM023.TextGrid

AE106LF025.TextGrid AE110CF039.TextGrid AE115CM021.TextGrid

AE120CF029.TextGrid AE126LF035.TextGrid AE128LF040.TextGrid

AE129LF039.TextGrid AE130LF027.TextGrid AE131CM030.TextGrid

AE132CM034.TextGrid AE133LM031.TextGrid AE134CM022.TextGrid

AE135CM025.TextGrid AE137CF024.TextGrid AE147LM021.TextGrid

AE154CM028.TextGrid AE158CF037.TextGrid AE161LM028.TextGrid

AE163LM022.TextGrid AE167LF040.TextGrid AE168LF039.TextGrid

AE169LF023.TextGrid AE173CM037.TextGrid AE174CM030.TextGrid

AE180CF028.TextGrid AE183LM040.TextGrid AE185LM030.TextGrid

AE187LF039.TextGrid AE189LF027.TextGrid AE190LF023.TextGrid

AE191CM029.TextGrid AE192CM038.TextGrid AE194CM037.TextGrid

AE196CF030.TextGrid AE200CF026.TextGrid AE203LF024.TextGrid

AE205LF027.TextGrid AE206LF039.TextGrid AE207LF023.TextGrid

AE212LF039.TextGrid AE214LM021.TextGrid AE216LF032.TextGrid

AE217LM027.TextGrid AE225LM035.TextGrid AE232LM033.TextGrid

AE251CF025.TextGrid AE253CF030.TextGrid AE255LM025.TextGrid

AE257LF025.TextGrid AE258LM031.TextGrid AE261CM039.TextGrid

AE263CF022.TextGrid AE265LF031.TextGrid AE266LM032.TextGrid

AE267LM025.TextGrid AE268LF029.TextGrid AE269CF028.TextGrid

AE270CF027.TextGrid AE277CM039.TextGrid AE283LM032.TextGrid

AE289LF023.TextGrid AE291CF027.TextGrid AE295LM023.TextGrid

AE300LF037.TextGrid AE312CF039.TextGrid AE314CM038.TextGrid

AE325LF037.TextGrid AE344LF031.TextGrid AE352LF034.TextGrid

AE359CF027.TextGrid AE365LF034.TextGrid AE373LF027.TextGrid

AE374LF032.TextGrid AE375CF029.TextGrid AE379LF022.TextGrid

AE381LF030.TextGrid AE384CF038.TextGrid AE385CF040.TextGrid

AE387LF026.TextGrid AE390LM022.TextGrid AE390LM030.TextGrid

but bat K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE004LM034.TextGrid AE015LM023.TextGrid AE018CF027.TextGrid

AE018CF035.TextGrid AE019CF037.TextGrid AE021LM028.TextGrid

AE023LM032.TextGrid AE032LM025.TextGrid AE036LF027.TextGrid







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 14 of 78





AE039LF026.TextGrid AE048LF021.TextGrid AE048LF025.TextGrid

AE054CM031.TextGrid AE069LF038.TextGrid AE077CF035.TextGrid

AE081LM035.TextGrid AE082LM035.TextGrid AE083LM035.TextGrid

AE091CM036.TextGrid AE092CM023.TextGrid AE093CM034.TextGrid

AE096CF030.TextGrid AE097LF033.TextGrid AE098LF032.TextGrid

AE100LF021.TextGrid AE103LM027.TextGrid AE104LM034.TextGrid

AE105LM036.TextGrid AE108LF036.TextGrid AE109LF038.TextGrid

AE111CM036.TextGrid AE112CM035.TextGrid AE113CM038.TextGrid

AE115CM020.TextGrid AE116CF038.TextGrid AE119CF040.TextGrid

AE121LM035.TextGrid AE122LM034.TextGrid AE124LM024.TextGrid

AE125LM028.TextGrid AE127LF028.TextGrid AE133LM027.TextGrid

AE138LF039.TextGrid AE139CF025.TextGrid AE141LM034.TextGrid

AE144LM032.TextGrid AE145LM035.TextGrid AE150LF031.TextGrid

AE152CM026.TextGrid AE155LM038.TextGrid AE157CF040.TextGrid

AE160LF021.TextGrid AE160LF025.TextGrid AE172CM038.TextGrid

AE178CF031.TextGrid AE181LM027.TextGrid AE184LM039.TextGrid

AE186LF031.TextGrid AE198CF030.TextGrid AE199LF040.TextGrid

AE204LM033.TextGrid AE219CM023.TextGrid AE226LM022.TextGrid

AE230CF039.TextGrid AE241LF038.TextGrid AE252CF038.TextGrid

AE264CF021.TextGrid AE332LM032.TextGrid AE344LF019.TextGrid

AE344LF022.TextGrid AE351LF021.TextGrid AE351LF032.TextGrid

AE372LM030.TextGrid AE372LM031.TextGrid AE377LF027.TextGrid

AE386CM029.TextGrid AE393LM030.TextGrid

but bV K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE029LF035.TextGrid AE031CF028.TextGrid AE033CM027.TextGrid

AE038CF038.TextGrid AE041LM022.TextGrid AE043LM022.TextGrid

AE044LM039.TextGrid AE049LF033.TextGrid AE050LF032.TextGrid

AE051CM038.TextGrid AE052CM034.TextGrid AE053CM033.TextGrid

AE058LF029.TextGrid AE060CF021.TextGrid AE063LM029.TextGrid

AE065LM029.TextGrid AE068LF021.TextGrid AE101LM035.TextGrid

AE162LM022.TextGrid AE179CF029.TextGrid AE208LF021.TextGrid

AE211CM031.TextGrid AE218CF040.TextGrid AE221LM036.TextGrid

AE223LF025.TextGrid AE227CM034.TextGrid AE231LF036.TextGrid

AE284LM025.TextGrid AE288CM040.TextGrid AE290LM037.TextGrid

AE311CM035.TextGrid AE314CM039.TextGrid AE314LF023.TextGrid

AE315CM023.TextGrid AE320CM022.TextGrid AE323LF028.TextGrid

AE324CF034.TextGrid AE331CM027.TextGrid AE343CM037.TextGrid

AE346LF034.TextGrid AE348LF036.TextGrid AE353LF036.TextGrid

AE361LF030.TextGrid AE366LF034.TextGrid AE376CM035.TextGrid

but b@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE020CF025.TextGrid AE047LF025.TextGrid AE056CF021.TextGrid

AE071CM029.TextGrid AE153CF023.TextGrid AE210CF031.TextGrid

AE213CM026.TextGrid AE318LM028.TextGrid AE330CF021.TextGrid

but b@t K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE002LM039.TextGrid AE045LM039.TextGrid AE170LF021.TextGrid

AE195CM037.TextGrid AE215CF034.TextGrid AE216LF022.TextGrid

AE259LM021.TextGrid AE345LM032.TextGrid AE378LF036.TextGrid

but;its bV4Its K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE293LM031.TextGrid

but;the bVt@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE008CF028.TextGrid AE013LM024.TextGrid AE024LM033.TextGrid



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 15 of 78





AE056CF024.TextGrid AE059CF036.TextGrid AE072CM023.TextGrid

AE080CF034.TextGrid AE099CF029.TextGrid AE148LF035.TextGrid

AE166LF033.TextGrid AE175CM028.TextGrid AE176CF036.TextGrid

AE188LF029.TextGrid AE222LF036.TextGrid AE268LF040.TextGrid

AE275CF023.TextGrid AE276CM024.TextGrid AE281CF020.TextGrid

AE281CF021.TextGrid AE285LF023.TextGrid AE292LM037.TextGrid

AE296LM035.TextGrid AE297LF025.TextGrid AE329LF033.TextGrid

AE338CM036.TextGrid AE347LM027.TextGrid AE350LF023.TextGrid

AE355LF028.TextGrid AE362CF036.TextGrid AE371LF024.TextGrid

AE380LF027.TextGrid AE389LM025.TextGrid

but;the bVd@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE062LM022.TextGrid AE073CM022.TextGrid AE076CF022.TextGrid

AE151CM038.TextGrid AE220LM021.TextGrid AE271LM026.TextGrid

AE289LF038.TextGrid AE299LF034.TextGrid AE340LF028.TextGrid

but;the bVD@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE029CF035.TextGrid AE074CM034.TextGrid AE182LM034.TextGrid

AE205LF033.TextGrid AE298LM026.TextGrid AE391LF034.TextGrid

but;the b@t@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE156CF022.TextGrid AE273LF029.TextGrid AE287LM040.TextGrid

AE388LM029.TextGrid

but;the bad@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE102LM025.TextGrid AE107LF022.TextGrid AE123LM040.TextGrid

but;the b@d@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE164LM033.TextGrid AE333LF036.TextGrid

but;the bVDI K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE034LM026.TextGrid AE294LF025.TextGrid

but;the bVtI K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE317LF038.TextGrid AE392LF032.TextGrid

but;the bat@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE145CM035.TextGrid

but;the bVt@? K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE354LF040.TextGrid

but;the bVdE: K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE142LF029.TextGrid

but;the bVt3: K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE313LM028.TextGrid

but;the bVt? K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE009LF024.TextGrid

but;the bV4@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE159CF029.TextGrid

but;the bVdI K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE293LM029.TextGrid

but;the bVt K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE171CM029.TextGrid

but;the;*elen+ bad9lIn K:\Phon_corrected_FLE\AE_phon_corrected_final\merge

% AE165LM033.TextGrid

but;the;elementary bVd@{l@mEntR\I

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE319LF030.TextGrid

but;the;elementary bVD{l@mEntr\i:

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 16 of 78





AE064LM035.TextGrid

but;the;elementary bVD@l@mEntr\I

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE342CM032.TextGrid

but;the;elementary bVDEl@mEntr\i

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE379LF021.TextGrid

but;the;elementary bVDEl@mEnt@r\i:

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE128LF038.TextGrid

but;the;elementary bVd@l@mEntr\I

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE394LM024.TextGrid

but;the;elementary bDE:l@mEntr\y

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE193CM035.TextGrid

but;the;elementary baDEl@mEntri:

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE143LM031.TextGrid

but;the;elementary baDEl@mEntr\i

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE001LM023.TextGrid

but;the;elementary bVdE:l@m@ntr\i:

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE356LM023.TextGrid

butterfly bVtfV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE003LM026.TextGrid

butterfly bVt@flV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE153CF029.TextGrid

butterscotch bat@rskQt-\S

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE028CF021.TextGrid AE028LF021.TextGrid

butterscotch bat@skQt-\S

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE028LF020.TextGrid

buy ba-\i K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE215CF029.TextGrid

buying ba-\ijIN K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE206LF031.TextGrid

by bV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE007LF037.TextGrid AE045LM026.TextGrid AE052CM030.TextGrid

AE142LF037.TextGrid AE158CF032.TextGrid AE166LF040.TextGrid

AE208LF033.TextGrid AE220LM023.TextGrid AE263CF029.TextGrid

AE270CF024.TextGrid AE276CM033.TextGrid AE285LF022.TextGrid

AE312CF037.TextGrid AE343CM030.TextGrid AE350LF040.TextGrid

AE359CF023.TextGrid

by ba-\i K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE001LM039.TextGrid AE116CF035.TextGrid AE150LF039.TextGrid

AE165LM030.TextGrid AE186LF038.TextGrid AE365LF021.TextGrid

by b@-\i K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE198CF014.TextGrid AE198CF040.TextGrid AE200CF015.TextGrid



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 17 of 78





by ba-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE032LM038.TextGrid AE092CM027.TextGrid

by;a bV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE361LF021.TextGrid

ca k{? K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE136CF032.TextGrid

cab k{:b K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE185LM021.TextGrid

cab k{ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE153CF037.TextGrid

cab k{b K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE141LM038.TextGrid

cab;broke k{:br\@-\uk K:\Phon_corrected_FLE\AE_phon_corrected_final\merge

% AE168LF026.TextGrid

cafeteria k{f@tI-\@r\Ij@

K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE104LM022.TextGrid

calf kA:f K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE185LM040.TextGrid

call kO:l K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE013LM036.TextGrid AE019CF034.TextGrid AE029CF024.TextGrid

AE029LF024.TextGrid AE031CF029.TextGrid AE036LF022.TextGrid

AE045LM034.TextGrid AE047LF028.TextGrid AE060CF034.TextGrid

AE075CM030.TextGrid AE077CF026.TextGrid AE079CF039.TextGrid

AE091CM038.TextGrid AE103LM026.TextGrid AE106LF032.TextGrid

AE130LF028.TextGrid AE135CM037.TextGrid AE144LM039.TextGrid

AE147LM040.TextGrid AE154CM022.TextGrid AE156CF028.TextGrid

AE168LF030.TextGrid AE169LF022.TextGrid AE171CM035.TextGrid

AE176CF031.TextGrid AE182LM038.TextGrid AE204LM035.TextGrid

AE223LF029.TextGrid AE230CF031.TextGrid AE275CF025.TextGrid

AE276CM025.TextGrid AE281CF036.TextGrid AE283LM026.TextGrid

AE329LF026.TextGrid AE333LF025.TextGrid AE359CF025.TextGrid

AE361LF025.TextGrid AE380LF040.TextGrid

call kO: K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE148LF031.TextGrid AE151CM023.TextGrid AE181LM035.TextGrid

AE287LM028.TextGrid AE371LF030.TextGrid

call kOl K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %

AE024LM027.TextGrid

.

.

.





Each entry/line has the following format:



ort_word phon_sequence directory % TextGrids



From this it should be obvious that given a lexicon entry it is easy (with a script of

course!) to locate its associated directories and TextGrid files. Note that the

percentage sign was used in order to simplify scripting.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 18 of 78



7) Select which words should be added to the lexicon:

1. All the orthographic words.

2. Only those words standing between dollar markers.

3. Only those words that are not standing between dollar markers.



Here the user can specify which words should be included in the lexicon.



2.3.4 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.3.5 Generated Output Files



If one or more errors are found in the TextGrids, the following file will be created in the

original startup directory before program execution is stopped:



• DirectoriesWithErrors.txt



Holds the names of all the directories in which TextGrid files with errors were found.

Note that this file will only be created if working in one or more directories.



In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)

subdirectory will be created in each affected directory that will hold the TextGrid files

containing the errors. Their alaw files (lying in working directory or alaw directory) will be

copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the

FWE folder and will contain an error message for each affected TextGrid file. Note that to

work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).



If VAR lexicon contains entries, the following files (filenames specified on command line by

user) will be created in the original startup directory:



• The file holding the VAR lexicon.

• The file holding the a priori lexicon.

• The file holding the counts lexicon.

• The file holding the location information.





2.4 CheckAndReplaceOrtNames.pl



2.4.1 Overview



Checks the name of the orthographic transcription in the TextGrid files. If it's not the

desired name it gets replaced with the correct one, namely “Orthographic”. TextGrid files

are allowed to contain only orthographic transcriptions, or orthographic and phonetic

transcriptions.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 19 of 78



2.4.2 Command Line Options



1) CheckAndReplaceOrtNames.pl



The orthographic transcription names are checked and replaced (if necessary) in

the TextGrid files that can be found in the directories which are specified in the list

file DirectoryList.lst.



2) CheckAndReplaceOrtNames.pl -f



Will check and replace (if necessary) the specified TextGrid file's orthographic

transcription name.



3) CheckAndReplaceOrtNames.pl -d



Will check and replace (if necessary) the orthographic transcription names of

specified directory's TextGrid files.



4) CheckAndReplaceOrtNames.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.4.3 Generated Output Files



See section 2.2.4.





2.5 CheckForErrors.pl



2.5.1 Overview



TextGrid files are checked for transcription errors. The files are allowed to contain only

orthographic transcriptions, or orthographic and phonetic transcriptions.



2.5.2 Command Line Options



1) CheckForErrors.pl



Will check the TextGrid files of the directories which are specified in the list file

DirectoryList.lst for errors.



2) CheckForErrors.pl -f



Will only check the specified TextGrid file for errors.



3) CheckForErrors.pl -d



Will only check the specified directory's TextGrid files for errors.



4) CheckForErrors.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 20 of 78



2.5.3 Questions



1) Checking:

1. Nguni

2. Sesotho

3. Afrikaans

4. English



Simply select the language that’s going to be worked with.



2) Does TextGrids contain ort and phon transcriptions (y/n)



If the TextGrid files contain orthographic and phonetic transcriptions answer “y”.

Otherwise, if the TextGrid files contain only orthographic transcriptions answer “n”.



3) Check utterance and sentence marker locations (y/n)



Answer “y” to this question in order to make sure that utterance and sentence

markers are used correctly with respect interval boundaries, and that pauses

always exist between utterances and sentences.



4) Check alignment & phonetic symbols (y/n)



Answer “y” if the alignment between the orthographic and phonetic transcriptions

must be checked (every orthographic segment must have a corresponding phonetic

segment) as well as the phonetic transcription’s phonetic symbols.



Note that this question will only be asked if the answer to Question 2 is “y”.



5) Check for identical ort and phon segments (y/n)



Answer “y” if the aligned transcriptions (ort & phon) must be checked for

orthographic words and phonetic sequences that are identical. These instances will

be treated as if they are actual errors, which of course is not necessarily the case.



Note that this question will only be asked if the answer to Question 4 is “y”.



6) Phon tier contains

1. XSAMPA

2. Praat



Select the phonetic transcription format.



Note that this question will only be asked if the answer to Question 4 is “y”.



2.5.4 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly if

the answer to Question 2.5.3.4 is “y”:









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 21 of 78



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.5.5 Generated Output Files



If one or more errors are found in the TextGrids, the following file will be created in the

original startup directory before program execution is stopped:



• DirectoriesWithErrors.txt



Holds the names of all the directories in which TextGrid files with errors were found.

Note that this file will only be created if working in one or more directories.



In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)

subdirectory will be created in each affected directory that will hold the TextGrid files

containing the errors. Their alaw files (lying in working directory or alaw directory) will be

copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the

FWE folder and will contain an error message for each affected TextGrid file. Note that to

work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).





2.6 CheckForInvalidIntervals.pl



2.6.1 Overview



Script checks intervals occurring at the beginning and the end of the transcriptions

(therefore the first interval and last interval) to see if they are invalid. TextGrids may

contain either orthographic transcriptions, or orthographic and phonetic transcriptions.



2.6.2 Command Line Options



1) CheckForInvalidIntervals.pl



Checks intervals occurring at the beginning and the end of the transcriptions in the

TextGrid files that can be found in the directories which are specified in the list file

DirectoryList.lst.



2) CheckForInvalidIntervals.pl -f



Checks intervals occurring at the beginning and the end of the transcriptions in the

specified TextGrid file.



3) CheckForInvalidIntervals.pl -d



Checks interval markers in the specified directory's TextGrid files.



4) CheckForInvalidIntervals.pl -l



The newly specified list file will be used instead of DirectoryList.lst.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 22 of 78



2.6.3 Generated Output Files



See Section 2.5.5 for the output files that will be generated if errors are located during

program execution.



If directories are found containing TextGrid files with possible (just because it’s flagged by

the script does not mean that it’s necessarily invalid!) invalid intervals, the following files

will be created:



• In the startup directory: DirectoriesWithPossibleIntervalsToRemove.txt



Will hold the directory names that contain TextGrid files with possible intervals to

remove.



• In each affected TextGrid directory: FilesWithPossibleIntervalsToRemove.txt



Will hold the names of the TextGrid files in this directory that contain possible

intervals to remove.





2.7 CheckLexicon.pl



2.7.1 Overview



This script checks a specified NOVAR5 or VAR6 lexicon for errors. The lexicon's phones

must be in XSAMPA format. No command line arguments are necessary.



2.7.2 Questions



1) Please enter the name of the file containing the lexicon.



Here you must enter the name of the NOVAR or VAR lexicon that must be checked.



2.7.3 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.7.4 Generated Output Files



If errors were found in the lexicon, one or more of the following files will be created in the

startup directory depending on the type of errors that were found:









5

A NOVAR lexicon contains only one phonetic sequence for each orthographic word.

6

A VAR lexicon usually contains various phonetic sequences for each orthographic word. The phonetic sequences of

an orthographic word are separated with commas.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 23 of 78



• GeneralLexiconErrors.txt



Will hold the general types of errors (this does not include phonetic symbol errors)

that were encountered on each line of the lexicon. An example of such a file is

shown belown:



Line 6: White space encountered at the end of the phone sequence.

Line 17: "]" should be the last character on rhs of "=" sign.

Line 22: "[" should be the first character after the "=" sign.

Line 23: "=" should not occur at the beginning of the line.

Line 41: White space encountered at the beginning of the phone sequence.







• PhoneLexiconErrors.txt



Will hold the invalid phonetic symbols that were encountered on each line of the

lexicon. An example of such a file is shown below:



Line 7: Unknown phone(s): basdf (6)

Line 14: Unknown phone(s): t-\S@l (3) , Sl@ (14) , t-\Sl (25)

Line 19: Unknown phone(s): b{ (1)

Line 20: Unknown phone(s): @fO: (10)





Note that the phone’s position will always be given between parentheses.



• UniqueUnknownPhonesInLexicon.txt



Will hold the unique list of invalid phonetic symbols that were encountered in the

lexicon.



• DuplicatedWordsInLexicon_CSS.txt



Will hold those orthographic words that were found to occur more than once during

a case sensitive search (CSS) of the lexicon.



If no errors were found in the lexicon, the following file could possibly be created in the

startup directory:



• DuplicatedWordsInLexicon_CIS.txt



Will hold those orthographic words that were found to occur more than once during

a case insensitive search (CIS) of the lexicon. Possible errors could be found this

way. The user should therefore always inspect this file before starting to use the

lexicon.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 24 of 78



2.8 CheckNumberOfALawsAndTextGrids.pl



2.8.1 Overview



Call-subdirectories are checked to see whether they contain the same number of alaw and

orthographic TextGrid files.



2.8.2 Command Line Options



1) CheckNumberOfALawsAndTextGrids.pl



Checks to see whether the number of alaw files and the number of orthographic

TextGrid files that can be found in the directories (each containing call-

subdirectories with these data files) which are specified in the list file

DirectoryList.lst, are the same.



2) CheckNumberOfALawsAndTextGrids.pl -d



Checks to see whether the number of alaw files and the number of orthographic

TextGrid files that can be found in the specified directory (containing call-

subdirectories with these data files) are the same.



3) CheckNumberOfALawsAndTextGrids.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.8.3 Generated Output Files



If one or more errors (with respect to number of alaw and TextGrid files in call-

subdirectories) are found, the following files will be created:



• In the original startup directory: DirectoriesWithErrors.txt



Holds the names of all the base directories containing one or more call-

subdirectories in which the number of alaw and TextGrid files are not the same.



• In affected base directories (holds call-subdirectories): CallFoldersWithProblems.txt



Holds the list of call folders in which alaw/TextGrid problems have been

encountered.



• In the affected call folders - one or both of the following files:



MissingALawFiles.txt



Holds the names of the missing alaw files.



MissingTextGridFiles.txt



Holds the names of the missing TextGrid files.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 25 of 78



2.9 Cleanup.pl



2.9.1 Overview



The transcriptions are forced to comply with the specifications with regards to space

characters, markers, spelled letters, empty intervals, and the use of */+/~/=. It can work

with TextGrid files containing only orthographic transcriptions, or orthographic and

phonetic transcriptions.



2.9.2 Command Line Options



1) Cleanup.pl



Will force the TextGrid files that can be found in the directories which are specified

in the list file DirectoryList.lst to comply with specifications.



2) Cleanup.pl -f



Will only force the specified TextGrid file to comply with specifications.



3) Cleanup.pl -d



Will only force the specified directory's TextGrid files to comply with specifications.



4) Cleanup.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.9.3 Questions



1) Remove "=" character from orthographic transcription (y/n)



Answer “y” to remove “=” characters from Afrikaans orthographic transcriptions.

Note that this characters are needed during the generation of the deterministic

phonetic transcriptions. Therefore, use this option wisely!



2.9.4 Generated Output Files



See Section 2.2.4.





2.10 CompareAlawAndTextGridFilenames.pl



2.10.1 Overview



Compares alaw and TextGrid filenames in order to make sure that each alaw file has a

TextGrid companion, and that each TextGrid file has an alaw companion. Note that a base

directory must be specified in a list file or on the command line. The base directory must

contain the alaw and TextGrid directories. The list file will be allowed to contain only one

entry.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 26 of 78



2.10.2 Command Line Options



1) CompareAlawAndTextGridFilenames.pl



Base directory will be extracted from DirectoryList.lst.



2) CompareAlawAndTextGridFilenames.pl -d



Base directory will be extracted from command line.



3) CompareAlawAndTextGridFilenames.pl -l



Base directory will be extracted from the newly specified list file instead of

DirectoryList.lst.



2.10.3 Generated Output Files



The following files could be created in the startup directory:



• TextGridsWithoutALawCompanions.txt



Will be created if TextGrids (in TextGrid directory) are found that don’t have alaw (in

alaw directory) companions.



• ALawsWithoutTextGridCompanions.txt



Will be created if alaws (in alaw directory) are found that don’t have TextGrid (in

TextGrid directory) companions.





2.11 Converter.pl



2.11.1 Overview



Converts phonetic transcriptions in TextGrid files from one format to another, e.g.

XSAMPA -> Praat or Praat -> XSAMPA. Note that the TextGrid files must contain

orthographic and phonetic transcriptions.



2.11.2 Command Line Options



1) Converter.pl



Converts the phonetic transcription format of the TextGrid files that can be found in

the directories which are specified in the list file DirectoryList.lst.



2) Converter.pl -f



Will only convert the phonetic transcription format of the specified TextGrid file.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 27 of 78



3) Converter.pl -d



Will only convert the phonetic transcription format of the specified directory's

TextGrid files.



4) Converter.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.11.3 Questions



1) For this script to work properly, the data must be free of all CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of ALL CFE (CheckForErrors.pl)

errors7. Otherwise, answer “n” in order to abort the program.



2) Convert

1. XSAMPA to Praat

2. Praat to XSAMPA



Simply choose the conversion process that must be performed.



2.11.4 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.11.5 Generated Output Files



See Section 2.2.4.





2.12 CopyAssimErrorFilesToFWE.pl



2.12.1 Overview



TextGrids with possible assimilation errors and their alaws are copied to FilesWithErrors

(FWE) subdirectory under the TextGrid directory. Note that a base directory must be

specified in a list file or on the command line. The base directory must contain the alaw

and TextGrid directories. The list file will be allowed to contain only one entry.









7

All CFE errors include a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 28 of 78



2.12.2 Command Line Options



1) CopyAssimErrorFilesToFWE.pl



Base directory will be extracted from DirectoryList.lst.



2) CopyAssimErrorFilesToFWE.pl -d



Base directory will be extracted from command line.



3) CopyAssimErrorFilesToFWE.pl -l



Base directory will be extracted from the newly specified list file instead of

DirectoryList.lst.



2.12.3 Questions



1) Please enter name of file holding assimilation information.



The name of the file holding the possible assimilation errors must be entered on the

command line in order to allow the script to determine which TextGrids – and

corresponding alaws – to copy to FilesWithErrors subdirectory under the TextGrid

folder. Note that to work through these error files use

WorkThroughFWEErrorFiles.pl (Section 2.35). Also, remember to copy the

assimilation error file to FWE subdirectory and rename it to FilesWithErrors.txt in

order for WorkThroughFWEErrorFiles.pl to work properly. An example of a file

holding the assimilation information follows below:



AE001LM002.TextGrid:

0 Van;_Der_Linde [/sta]

0 fan@rl@nd@ [/sta]



AE001LM023.TextGrid:

(s) Helen's;stint as

(s) h{l@nst@nt {z



AE001LM023.TextGrid:

(s) but;the;elementary practical

(s) baDEl@mEntr\i pr\{ktik@l



AE001LM033.TextGrid:

one eight;two three

wan @-\itu Tri



AE001LM039.TextGrid:

prison was;surrounded by

pr\@z@n wQs@r\a-\und@d ba-\i









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 29 of 78



2.13 CopyPhoneCorrectedFiles.pl



2.13.1 Overview



Copies phonetically corrected data that's scattered over several base directories to a

single target base directory. Note that each base directory must contain alaw, attrib, lab,

and TextGrid (merge) directories.



2.13.2 Command Line Options



1) CopyPhoneCorrectedFiles.pl



Copies phonetically corrected data that can be found in the base directories which

are specified in the list file DirectoryList.lst to a single target base directory.



2) CopyPhoneCorrectedFiles.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.13.3 Questions



1) Work in merge_XSAMPA (y/n)



Answer “y” to this question if the TextGrid files are lying in merge_XSAMPA

directories. Ontherwise, answer “n” if they are lying in merge directories.



2) Please enter name of directory that will hold the data.



Enter name of target base directory – it should exist. Note that the following

directories will be created in it in order to hold the data, namely:

alaw/attrib/lab/merge.





2.14 CopySNBDErrorFilesToFWE.pl



2.14.1 Overview



TextGrids with SNBD (sil/nonsil/boundary/duration) errors and their alaws are copied to

FilesWithErrors (FWE) subdirectory under the TextGrid directory. Note that a base

directory must be specified in a list file or on the command line. The base directory must

contain the alaw and TextGrid (merge) directories. The list file will be allowed to contain

only one entry. After copying the TextGrid and alaw files to FWE subdirectory a

FilesWithErrors.txt file will be created containing information about the SNBD errors. You

can work through the TextGrid files containing the errors using

WorkThroughFWEErrorFiles.pl (Section 2.35).



2.14.2 Command Line Options



1) CopySNBDErrorFilesToFWE.pl



Base directory will be extracted from DirectoryList.lst.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 30 of 78



2) CopySNBDErrorFilesToFWE.pl -d



Base directory will be extracted from command line.



3) CopySNBDErrorFilesToFWE.pl -l



Base directory will be extracted from the newly specified list file instead of

DirectoryList.lst.



2.14.3 Questions



1) Are there any sil errors (y/n)



Answer “y” if TextGrid files with sil errors8 exist. To move on to Question 3 answer

“n”.



2) Please enter the name of the sil file.



The name of the file holding the sil errors must be entered on the command line in

order to allow the script to determine which TextGrids – and corresponding alaws –

containing this specific type of error to copy to FilesWithErrors subdirectory under

the TextGrid folder. An example of this type of error file is shown below:



AE016CF021_006

AE018CF031_003

AE058LF008_001

AE062LM002_004

AE067LF009_010

AE088LF038_005

AE106LF017_008

AE112CM010_008

AE115CM022_002

AE116CF022_001

AE116CF040_007

AE130LF004_003

AE130LF024_001

AE130LF040_009

AE144LM003_001

AE155LM038_022

AE162LM003_004





The TextGrid base-names and interval numbers (containing the sil errors) are

separated with underscores. Edward wrote a program producing this type of error

file. Further enquiries regarding the generation of these files should therefore be

directed towards him.



Note that this question will only be asked if the answer to Question 1 is “y”.









8

A sil error is when an interval is marked as [sil], but should rather have been for example [int] or [sta].



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 31 of 78



3) Are there any nonsil errors (y/n)



Answer “y” if TextGrid files with nonsil errors9 exist. To move on to Question 5

answer “n”.



4) Please enter the name of the nonsil file.



The name of the file holding the nonsil errors must be entered on the command line

in order to allow the script to determine which TextGrids – and corresponding alaws

– containing this specific type of error to copy to FilesWithErrors subdirectory under

the TextGrid folder. The format of this file is the same as in the case of the sil errors

– see Question 2.



Note that this question will only be asked if the answer to Question 3 is “y”.



5) Are there any boundary errors (y/n)



Answer “y” if TextGrid files with boundary errors10 exist. To move on to Question 7

answer “n”.



6) Please enter the name of the boundary file.



The name of the file holding the boundary errors must be entered on the command

line in order to allow the script to determine which TextGrids – and corresponding

alaws – containing this specific type of error to copy to FilesWithErrors subdirectory

under the TextGrid folder.



AE006LF019_002 right

AE006LF039_003 right

AE017LF002_003 left

AE017LF003_002 right

AE017LF006_002 left

AE017LF010_002 left

EE004LF029_003 right

AE020LM001_003 right

AE020LM013_002 right

AE023LM009_002 right

AE023LM024_003 right

AE024LF007_003 right

AE024LF030_003 right

AE024LF033_002 right

AE024LF037_002 right

AE024LF038_002 right

AE024LF040_003 right





The TextGrid base-names and interval numbers (containing the boundary errors)

are separated with underscores. In addition to this, an indication will be given after

an interval number as to whether the error occurred on the left of right boundary.

Edward wrote a program producing this type of error file. Further enquiries

regarding the generation of these files should therefore be directed towards him.

9

A nonsil error is when an interval is marked as e.g. [int] or [sta], but should rather be marked as [sil].

10

A boundary error is when a boundary is not placed at the correct location within the transcription.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 32 of 78





Note that this question will only be asked if the answer to Question 5 is “y”.



7) Are there any duration errors (y/n)



Answer “y” if TextGrid files with duration errors11 exist. To move on answer “n”.



8) Please enter the name of the duration file.



The name of the file holding the duration errors must be entered on the command

line in order to allow the script to determine which TextGrids – and corresponding

alaws – containing this specific type of error to copy to FilesWithErrors subdirectory

under the TextGrid folder. The format of this file is the same as in the case of the sil

errors – see Question 2.



Note that this question will only be asked if the answer to Question 7 is “y”.





2.15 DeleteRedundantALTFilesLookingAtAttribs.pl



2.15.1 Overview



ALTs – Alaws (alaw folder), Lab files (lab folder – optional) and TextGrids (merge folder) –

that do not have corresponding attribute files (attrib folder) will be removed. Note that a

base directory must be specified in a list file or on the command line. The base directory

must contain the alaw, attrib, lab (optional) and merge directories. The list file will be

allowed to contain only one entry.



2.15.2 Command Line Options



1) DeleteRedundantALTFilesLookingAtAttribs.pl



Base directory will be extracted from DirectoryList.lst.



2) DeleteRedundantALTFilesLookingAtAttribs.pl -d



Base directory will be extracted from command line.



3) DeleteRedundantALTFilesLookingAtAttribs.pl -l



Base directory will be extracted from the newly specified list file instead of

DirectoryList.lst.









11

A duration error is when an interval is either to long or to short.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 33 of 78



2.16 ExtractTranscriptions.pl



2.16.1 Overview



Orthographic and/or phonetic transcriptions are extracted from TextGrid files. TextGrids

are allowed to contain only orthographic transcriptions, or orthographic and phonetic

transcriptions.



2.16.2 Command Line Options



1) ExtractTranscriptions.pl



Extracts orthographic and/or phonetic transcriptions from TextGrid files that can be

found in the directories which are specified in the list file DirectoryList.lst.



2) ExtractTranscriptions.pl -f



Will only extract orthographic and/or phonetic transcriptions from the specified

TextGrid file.



3) ExtractTranscriptions.pl -d



Will only extract orthographic and/or phonetic transcriptions from the specified

directory's TextGrid files.



4) ExtractTranscriptions.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.16.3 Questions



1) Does TextGrids contain ort and phon transcriptions (y/n)



Answer “y” if the TextGrids contain both orthographic and phonetic transcriptions.

Otherwise, if the TextGrids contain only orthographic transcriptions answer “n”.



2) Extract

1. Ort

2. Phon

3. Ort & Phon



Select which transcriptions to extract.



Note that this question will only be asked if the answer to Question 1 is “y”.



2.16.4 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files.



If working with one or more directories, the extracted transcriptions will be stored in the

following file in each TextGrid directory:



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 34 of 78





• Transcriptions.txt



This file will hold a summary of each TextGrid’s transcriptions. Each TextGrid

directory will have such a file. As an example suppose we have a number of

TextGrids in a directory containing orthographic and phonetic transcriptions. If the

script extracts only the orthographic transcriptions, the text file will contain the

following (intervals will always be indicated with the “ character):



AE001LM001.TextGrid:

"0" "[spk]" "(u) A. 0 E. zero zero one (/u)" "[spk]" "[sta]" "[ext]"



AE001LM002.TextGrid:

"(u) [sta] Martin 0 Isak 0 Van;_Der_Linde [/sta] (/u)" "[spk]" "[sta]"



AE001LM003.TextGrid:

"(u) [sta] V. 0 A. E. N. [/sta] (/u)" "[spk]" "(u) [sta] D. 0 E. R. 0 L. 0 I. [/sta] (/u)"



AE001LM004.TextGrid:

"[int]" "[sta]"



AE001LM005.TextGrid:

"[spk]" "(u) nineteen (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"







However, if the script extracts only the phonetic transcriptions the following will be

written to the file:



AE001LM001.TextGrid:

"0" "[spk]" "(u) ?@-\i 0 i zI-\@r@-\u zI-\@r@-\u wan (/u)" "[spk]" "[sta]" "[ext]"



AE001LM002.TextGrid:

"(u) [sta] ma:rt@n 0 isak 0 fan@rl@nd@ [/sta] (/u)" "[spk]" "[sta]"



AE001LM003.TextGrid:

"(u) [sta] vi 0 @-\i i En [/sta] (/u)" "[spk]" "(u) [sta] di 0 i a:r\ 0 {l 0 a-\i [/sta] (/u)"



AE001LM004.TextGrid:

"[int]" "[sta]"



AE001LM005.TextGrid:

"[spk]" "(u) na-\intin (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"





Finally, if both orthographic and phonetic transcriptions are extracted, the

Transcriptions.txt will look as follows:









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 35 of 78



AE001LM001.TextGrid:

"0" "[spk]" "(u) A. 0 E. zero zero one (/u)" "[spk]" "[sta]" "[ext]"

"0" "[spk]" "(u) ?@-\i 0 i zI-\@r@-\u zI-\@r@-\u wan (/u)" "[spk]" "[sta]" "[ext]"



AE001LM002.TextGrid:

"(u) [sta] Martin 0 Isak 0 Van;_Der_Linde [/sta] (/u)" "[spk]" "[sta]"

"(u) [sta] ma:rt@n 0 isak 0 fan@rl@nd@ [/sta] (/u)" "[spk]" "[sta]"



AE001LM003.TextGrid:

"(u) [sta] V. 0 A. E. N. [/sta] (/u)" "[spk]" "(u) [sta] D. 0 E. R. 0 L. 0 I. [/sta] (/u)"

"(u) [sta] vi 0 @-\i i En [/sta] (/u)" "[spk]" "(u) [sta] di 0 i a:r\ 0 {l 0 a-\i [/sta] (/u)"



AE001LM004.TextGrid:

"[int]" "[sta]"

"[int]" "[sta]"



AE001LM005.TextGrid:

"[spk]" "(u) nineteen (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"

"[spk]" "(u) na-\intin (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"









2.17 GenerateLabFiles.pl



2.17.1 Overview



Generates lab files for AST engineers using TextGrid files containing orthographic and

phonetic transcriptions. Only the phonetic transcriptions will be listed in the lab files.



2.17.2 Command Line Options



1) GenerateLabFiles.pl



Generates lab files for the TextGrid files that can be found in the directories which

are specified in the list file DirectoryList.lst.



2) GenerateLabFiles.pl -f



Generates a lab file for the specified TextGrid file.



3) GenerateLabFiles.pl -d



Generates lab files for the TextGrid files that can be found in the specified directory.



4) GenerateLabFiles.pl -l



The newly specified list file will be used instead of DirectoryList.lst.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 36 of 78



2.17.3 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.17.4 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files during process of running this script.



For each TextGrid file a lab file will be generated. The lab files will be created in either the

TextGrid directory (if lab directory does not exist), or a lab directory (if it exists). As an

example of how a lab file looks, first consider the following TextGrid file:



File type = "ooTextFile"

Object class = "TextGrid"



xmin = 0

xmax = 5.6879999999999997

tiers?

size = 2

item []:

item [1]:

class = "IntervalTier"

name = "Orthographic"

xmin = 0

xmax = 5.6879999999999997

intervals: size = 5

intervals [1]:

xmin = 0

xmax = 0.43666928678720285

text = "0"

intervals [2]:

xmin = 0.43666928678720285

xmax = 0.7731976451954351

text = "[spk]"

intervals [3]:

xmin = 0.7731976451954351

xmax = 2.4472260395268113

text = "(s) i am going to speak english (/s)"

intervals [4]:

xmin = 2.4472260395268113

xmax = 2.944283528471245

text = "[spk]"









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 37 of 78





intervals [5]:

xmin = 2.944283528471245

xmax = 5.6879999999999997

text = "[sta]"

item [2]:

class = "IntervalTier"

name = "phon1"

xmin = 0

xmax = 5.6879999999999997

intervals: size = 5

intervals [1]:

xmin = 0

xmax = 0.43666928678720285

text = "0"

intervals [2]:

xmin = 0.43666928678720285

xmax = 0.7731976451954351

text = "[spk]"

intervals [3]:

xmin = 0.7731976451954351

xmax = 2.4472260395268113

text = "(s) V-\I {m g@-\UwIN tu: spi:k INglIS (/s)"

intervals [4]:

xmin = 2.4472260395268113

xmax = 2.944283528471245

text = "[spk]"

intervals [5]:

xmin = 2.944283528471245

xmax = 5.6879999999999997

text = "[sta]"



It’s generated lab will look as follows:



BHEAD

EHEAD



End_Time Phon1 Boundary_Type Category

0.43667 [sil] Manual sil

0.77320 [spk] Manual spk

0.85690 V-\I EquiDiv phon

0.94060 { EquiDiv phon

1.02430 m EquiDiv phon

1.10800 g EquiDiv phon

1.19170 @-\U EquiDiv phon

1.27541 w EquiDiv phon

1.35911 I EquiDiv phon

1.44281 N EquiDiv phon

1.52651 t EquiDiv phon

1.61021 u: EquiDiv phon

1.69391 s EquiDiv phon





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 38 of 78





1.77761 p EquiDiv phon

1.86132 i: EquiDiv phon

1.94502 k EquiDiv phon

2.02872 I EquiDiv phon

2.11242 N EquiDiv phon

2.19612 g EquiDiv phon

2.27982 l EquiDiv phon

2.36352 I EquiDiv phon

2.44723 S Manual phon

2.94428 [spk] Manual spk

5.68800 [sta] Manual sta





Note that the “Boundary_Type” column shows where interval boundaries occur that was

put in by hand by using the word “Manual”. The “EquiDiv” boundaries are calculated with

the script and is NOT accurate. These values are obtained by simply dividing an interval’s

length by the number of items in it. The “Category” column simply states the type of event

that is taking place during that time slice.



Note that this script should be updated in order for the lab files to show more information,

e.g. orthographic information.





2.18 GenOrt.pl



2.18.1 Overview



Generates orthographic TextGrids from TextGrid files containing orthographic and

phonetic transcriptions. TextGrid files will only be allowed to contain orthographic and

phonetic transcriptions. WARNING: The original TextGrid files will be replaced with the

newly generated orthographic TextGrids.



2.18.2 Command Line Options



1) GenOrt.pl



Will generate orthographic TextGrids from the merged files that can be found in the

directories which are specified in the list file DirectoryList.lst.



2) GenOrt.pl -f



Will only generate orthographic TextGrid from the specified merged file.



3) GenOrt.pl -d



Will only generate orthographic TextGrids from the specified directory's merged

files.



4) GenOrt.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 39 of 78



2.18.3 Questions



1) WARNING: This script will replace merged files with orthographic TextGrids.

Do you want to continue? (y/n)



Answer “y” if you want to go ahead with generating the orthographic TextGrids from

the merged TextGrids. Otherwise, answer “n” to abort the program.



2.18.4 Generated Output Files



See Section 2.2.4.





2.19 GetPronouncedAcronyms.pl



2.19.1 Overview



Extracts pronounced acronyms from TextGrid files. The files are allowed to contain only

orthographic transcriptions, or orthographic and phonetic transcriptions.



2.19.2 Command Line Options



1) GetPronouncedAcronyms.pl



Extracts pronounced acronyms from the TextGrid files that can be found in the

directories which are specified in the list file DirectoryList.lst.



2) GetPronouncedAcronyms.pl -f



Extracts pronounced acronyms from the specified TextGrid file.



3) GetPronouncedAcronyms.pl -d



Extracts pronounced acronyms from the specified directory's TextGrid files.



4) GetPronouncedAcronyms.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.19.3 Questions



1) For this script to work properly, the data must be free of normal CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)

errors12. Otherwise, answer “n” in order to abort the program.









12

Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 40 of 78



2.19.4 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files during process of running this script.



If pronounced acronyms are found, the following file will be created in the startup directory:



• PronouncedAcronyms.txt



The pronounced acronyms will simply be listed in this file. It will be created whether

working with one or more directories or a single TextGrid (-f switch).





2.20 GetTextGridDirectoriesRecursively.pl



2.20.1 Overview



Recursively looks for directories containing TextGrid files.



2.20.2 Command Line Options



1) GetTextGridDirectoriesRecursively.pl



Will recursively look for folders containing TextGrid files in the directories listed in

DirectoryList.lst.



2) GetTextGridDirectoriesRecursively.pl -d



Will recursively look for folders containing TextGrid files in the specified directory.



3) GetTextGridDirectoriesRecursively.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.20.3 Generated Output Files



If directories were found containing TextGrid files the following file will be created in the

startup directory:



• Directories.txt



The directories that contain TextGrid files will simply be listed in this file.





2.21 Lin2Dos.pl



2.21.1 Overview



Converts a text file from Linux to DOS/Win format. Note that the original text file will be

overwritten.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 41 of 78



2.21.2 Command Line Options



• Lin2Dos.pl





2.22 MoveDuplicateAttribs.pl



2.22.1 Overview



Move all duplicate attribute files from attribute directory to a specified directory. Note that

the attribute directory must be specified in a list file or on the command line. The list file will

be allowed to contain only one entry. Also, duplicate attribute files will be removed from the

attribute directory during the process of moving them.



2.22.2 Command Line Options



1) MoveDuplicateAttribs.pl



Attribute directory will be extracted from DirectoryList.lst.



2) MoveDuplicateAttribs.pl -d



Attribute directory will be extracted from command line.



3) MoveDuplicateAttribs.pl -l



Attribute directory will be extracted from the newly specified list file instead of

DirectoryList.lst.



2.22.3 Questions



1) Please enter name of the destination folder.



Specify the destination folder’s name to which the identified duplicate attribute files

must be moved.



2) Remove files that exist in this directory (y/n)



If this destination directory contains files that should be removed answer “y”.

Otherwise, to specify a different directory answer “n”.





2.23 MovePhoneticTextGrids.pl



2.23.1 Overview



Generated (using Patana) phonetic transcription files are moved from orthographic

subdirectories (ort) to phonetic subdirectories (phon).









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 42 of 78



2.23.2 Command Line Options



1) MovePhoneticTextGrids.pl



Generated phonetic transcriptions that can be found in the directories (each

containing subdirectories ort and phon) which are specified in the list file

DirectoryList.lst are moved from orthographic subdirectories (ort) to phonetic

subdirectories (phon).



2) MovePhoneticTextGrids.pl -d



Generated phonetic transcriptions that can be found in the specified directory

(containing subdirectories ort and phon) are moved from orthographic subdirectory

(ort) to phonetic subdirectory (phon).



3) MovePhoneticTextGrids.pl -l



The newly specified list file will be used instead of DirectoryList.lst.





2.24 NumberOfMalesAndFemales.pl



2.24.1 Overview



Determines how many male and female callers there are for a batch of data. This

information will be extracted from the attribute filenames lying under the attrib directories.



2.24.2 Command Line Options



1) NumberOfMalesAndFemales.pl



Attribute directory names will be extracted from DirectoryList.lst.



2) NumberOfMalesAndFemales.pl -d



Attribute directory names will be extracted from command line.



3) NumberOfMalesAndFemales.pl -l



Attribute directory names will be extracted from the newly specified list file instead

of DirectoryList.lst.





2.25 ProcessRawTranscriptionData.pl



2.25.1 Overview



Processes raw transcription data lying in call folders - renames alaw, attribute and

TextGrid files and then moves them to the appropriate directories. The call folders will be

replaced by the following directories: alaw/attrib/lab/merge/ort/phon. The only directories of

these that will contain any files after the process has been completed will be alaw, attrib

and ort.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 43 of 78



2.25.2 Command Line Options



1) ProcessRawTranscriptionData.pl



Processes the raw data that can be found in the directories (the base directories to

call folders) which are specified in the list file DirectoryList.lst.



2) ProcessRawTranscriptionData.pl -d



Will only process the specified directory's (the base directory to call folders) raw

data.



3) ProcessRawTranscriptionData.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.25.3 Questions



1) What value must M have (A/E/I/B/C/X/S/Z allowed).



This value of M will be used for error checking purposes. As can be seen, only A, E,

I, B, C, X, S, and Z will be allowed.



2) What value must L have (A/E/X/S/Z allowed).



This value of L will be used for error checking purposes. As can be seen, only A, E,

X, S, and Z will be allowed.



2.25.4 Generated Output Files



If any of the attribute files were found to contain errors, the following file will be created in

the startup directory before aborting the program:



• AttribErrors.txt



Will list all the errors that occurred in the attribute files as seen over all the attribute

directories it worked in.



2.26 RemoveDuplicateAttrib.pl



2.26.1 Overview



Removes duplicate attribute files that have been identified in earlier sessions from attribute

directory by looking at the call numbers contained within each attribute file. Note that the

attribute directory must be specified in a list file or on the command line. The list file will be

allowed to contain only one entry. The script will ask for the name of the directory holding

the duplicate attribute files.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 44 of 78



2.26.2 Command Line Options



1) RemoveDuplicateAttribs.pl



Attribute directory will be extracted from DirectoryList.lst.



2) RemoveDuplicateAttribs.pl -d



Attribute directory will be extracted from command line.



3) RemoveDuplicateAttribs.pl -l



Attribute directory will be extracted from the newly specified list file instead of

DirectoryList.lst.



2.26.3 Questions



1) Please enter name of folder holding duplicate attributes.



Specify where duplicate attribute files can be found that must be removed from

attribute directory. Note that the deletion process looks at the call numbers within

the attribute files in order to figure out which files to remove. The attribute filenames

therefore does not play a part in the deletion process.





2.27 RemoveInvalidIntervals.pl



2.27.1 Overview



Removes those interval markers that were wrongly put into the transcriptions at the

beginning and the end of the speech (therefore the first interval and last interval) in Praat.

TextGrids may contain either orthographic transcriptions, or orthographic and phonetic

transcriptions. Warning: Avoid running this script more than once on the data.



2.27.2 Command Line Options



1) RemoveInvalidIntervals.pl



Removes invalid intervals from the TextGrid files that can be found in the directories

which are specified in the list file DirectoryList.lst.



2) RemoveInvalidIntervals.pl -f



Removes invalid intervals from the specified TextGrid file.



3) RemoveInvalidIntervals.pl -d



Removes invalid intervals from the specified directory's TextGrid files.



4) RemoveInvalidIntervals.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 45 of 78





2.27.3 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files during the process of running this script.



If working in one or more directories the following files could be created in the startup

directory:



• DirectoriesWithUpdates.txt



Those directories in which invalid intervals were remove from TextGrid files will be

listed in this file.



• DirectoriesWithPossibleIntervalsToRemove.txt



Will list those directories that were found to contain TextGrid files will intervals that

should possibly be removed – these are intervals that the script was unsure about

and as a result did not remove.



Those directories containing TextGrid files with possible intervals to remove will each

contain the following file:



• FilesWithPossibleIntervalsToRemove.txt



Will hold the names of the TextGrid files in this directory that contain possible

intervals to remove.





2.28 RemoveTextGridFilesWithErrors.pl



2.28.1 Overview



Removes TextGrid files with errors from TextGrid directory. Note that the names of the

TextGrid files containing the errors will be obtained from the FilesWithErrors subdirectory

lying under the TextGrid folder. Use this script wisely!



2.28.2 Command Line Options



1) RemoveTextGridFilesWithErrors.pl



Will remove TextGrid files containing errors from the specified TextGrid directory.



2.28.3 Questions



1) WARNING: This script will removed TextGrid files containing errors.

Do you want to continue? (y/n)



To remove the TextGrids containing errors answer “y”. Otherwise, to abort the

program answer “n”.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 46 of 78



2.29 Renamer.pl



2.29.1 Overview



This script fixes those filenames that are not according to spec with respect to case and

counter. Note that each base folder should contain alaw, attrib, lab, and either merge,

merge_Praat or merge_XSAMPA directories.



2.29.2 Command Line Options



1) Renamer.pl



Will fix those filenames which are located in the folders that can be found in the

directories which are specified in the list file DirectoryList.lst.



2) Renamer.pl -d



Will fix those filenames which are located in the folders that can be found in the

specified directory.



3) Renamer.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.29.3 Questions



1) In which directory does the TextGrid files occur:

1. merge

2. merge_Praat

3. merge_XSAMPA



Specify in which directories the TextGrid files occur. Note that the directory names

MUST correspond to the option that you choose.





2.30 ReplaceAssimilations.pl



2.30.1 Overview



Replaces assimilations (#1..#2..#3). It has the ability to replace assimilations with the text

standing between #1..#2 or #2..#3, or it can replace it which something that is specified by

the user. TextGrid files are allowed to contain only orthographic transcriptions, or

orthographic and phonetic transcriptions.



2.30.2 Command Line Options



1) ReplaceAssimilations.pl



Will replace assimilations in the TextGrid files that can be found in the directories

which are specified in the list file DirectoryList.lst.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 47 of 78



2) ReplaceAssimilations.pl -f



Will replace assimilations in the specified TextGrid file.



3) ReplaceAssimilations.pl -d



Will replace assimilations in the specified directory's TextGrid files.



4) ReplaceAssimilations.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.30.3 Questions



1) Replace assimilations with contents between #1 and #2 (y/n)



Answer “y” if assimilations exist that must be replaced with contents between

#1..#2. To move on to Question 3 answer “n”.



2) Please enter name of the file holding these assimilations.



Enter name of file that will hold the list of assimilations which must be replaced with

the text between #1..#2.



Note that this question will only be asked if the answer to Question 1 is “y”.



3) Replace assimilations with contents between #2 and #3 (y/n)



Answer “y” if assimilations exist that must be replaced with contents between

#2..#3. To move on to Question 5 answer “n”.



4) Please enter name of the file holding these assimilations.



Enter name of file that will hold the list of assimilations which must be replaced with

the text between #2..#3.



Note that this question will only be asked if the answer to Question 3 is “y”.



5) Replace assimilations with own suggestions (y/n)



Answer “y” if assimilations exist that must be replaced with your own suggestions.

To move on answer “n”.



6) Please enter name of the file holding these assimilations.



Enter name of file that will hold the list of assimilations and their replacements. The

file should have the following header: “OLD NEW”. This header therefore

defines two columns. Under the “OLD” column will be the list of assimilations

(#1..#2..#3) that must be replaced, while under the “NEW” column will be written

their replacements. Note that the replacement strings should not contain any

spaces.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 48 of 78



Note that this question will only be asked if the answer to Question 5 is “y”.



2.30.4 Generated Output Files



See Section 2.2.4.





2.31 Rip.pl



2.31.1 Overview



This script determines which orthographic words in a given batch of TextGrid files do not

occur in the lexicon. It can work with TextGrid files containing only orthographic

transcriptions, or orthographic and phonetic transcriptions.



2.31.2 Command Line Options



1) Rip.pl



Will check the TextGrid files that can be found in the directories which are specified

in the list file DirectoryList.lst for words that are not in the lexicon.



2) Rip.pl -f



Will check the specified TextGrid file for words that are not in the lexicon.



3) Rip.pl -d



Will check the specified directory's TextGrid files for words that are not in the

lexicon.



4) Rip.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.31.3 Questions



1) For this script to work properly, the data must be free of normal CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)

errors13. Otherwise, answer “n” in order to abort the program.



2) Enter the name of the file containing the NOVAR lexicon.



Specify the name of the NOVAR lexicon14 against which the script compares the

orthographic words it encounters in order to see which occur in the lexicon and

which do not.





13

Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.

14

A NOVAR lexicon contains only one phonetic sequence for each orthographic word.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 49 of 78



3) Select which words should be compared to the lexicon

1. All the orthographic words.

2. Only those words standing between dollar markers.

3. Only those words that are not standing between dollar markers.



Here the user must specify which words should be compared to the lexicon entries.



4) Only check words which contain internal zeros (y/n)



Answer “y” if only those words with internal zeros should be compared to the

lexicon. Otherwise, answer “n” to compare ALL the words (including the words with

internal zeros) to the lexicon.



2.31.4 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.31.5 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files during the process of running this script.



If working with one or more directories, the following files could possibly be created in the

startup directory:



• TextGridFilesContainingWordsNotInLexicon.txt



This file will hold for each affected TextGrid, the words which do not occur in the

lexicon. The words are identified during a CIS (case insensitive search) and should

thus be included in the lexicon. An example of such a file is shown below (note: the

lexicon contained few entries):



J:\PhonCorrect\EE\merge:



EE002LM002.TextGrid - Words not in lexicon:

Michael

Titlestad



EE002LM003.TextGrid - Words not in lexicon:

A.

D.

E.

I.

L.

S.

T.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 50 of 78





EE002LM004.TextGrid - Words not in lexicon:

male



EE002LM005.TextGrid - Words not in lexicon:

six

thirty



EE002LM006.TextGrid - Words not in lexicon:

September

four

nineteen

of

sixth

sixty

twenty



.

.

.





• TextGridFilesContainingWordsWithCaseProblems.txt



This file will hold for each affected TextGrid, the words which do not occur in the

lexicon during a CSS (case sensitive search), but do occur during a CIS. These

words could possibly contain case problems and as such should be handled

separately from those words which are identified during a CIS. An example of such

a file is shown below (the same lexicon and data was used as in the above

example):





J:\PhonCorrect\EE\merge:



EE002LM014.TextGrid - Words with case problems:

May



EE002LM017.TextGrid - Words with case problems:

No



EE004LF014.TextGrid - Words with case problems:

May



EE006LF006.TextGrid - Words with case problems:

May



.

.

.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 51 of 78



This example shows that two words cause problems, namely “May” and “No”. Both

occurred in the lexicon as “may” and “no”. Now, since the word “may” occurred in

the lexicon, but the word “May” did not, the word “May” was flagged as possibly

being a word with a case problem. However, in this instance it was not, since “May”

was used as the name of a month and as such should be included in the lexicon

(remember that the lexicon is case sensitive!). On the other hand, the word “No”

was incorrectly used in the data, since it should have been written as “no”. As a

result, it must be changed in the TextGrid file to “no”.



• WordsNotInLexicon.txt



Will hold a unique list of all the words that were identified during a CIS to not be

included in the lexicon.



• WordsWithCaseProblems.txt



Will hold a unique list of all the words which do not occur in the lexicon during a

CSS, but do occur during a CIS.



If working with a single TextGrid file, the following files could possibly be created in the

startup directory:



• WordsNotInLexicon.txt



• WordsWithCaseProblems.txt





2.32 SubstitutePhonCharacters.pl



2.32.1 Overview



Substitutes certain characters (or strings) in each TextGrid file's phonetic transcription with

new ones. The characters (or strings) and their replacements must be specified in

SubstitutePhonCharacters.lst. Also note that the TextGrid files must contain both

orthographic and phonetic transcriptions.



2.32.2 Command Line Options



1) SubstitutePhonCharacters.pl



Substitutes certain phon characters in the TextGrid files that can be found in the

directories which are specified in the list file DirectoryList.lst.



2) SubstitutePhonCharacters.pl -f



Substitutes certain phon characters in the specified TextGrid file.



3) SubstitutePhonCharacters.pl -d



Substitutes certain phon characters in the specified directory's TextGrid files.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 52 of 78



4) SubstitutePhonCharacters.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.32.3 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• SubstitutePhonCharacters.lst



Will hold the list of characters or text strings to be replaced and their replacements.

The file should have the following header: “OLD NEW”. This header therefore

defines two columns. Under the “OLD” column will be the list of text strings that

must be replaced, while under the “NEW” column will be written their replacement

strings. Note that the replacement strings should not contain any spaces.



2.32.4 Generated Output Files



See Section 2.2.4.





2.33 Transcribe.pl



2.33.1 Overview



This script can generates deterministic phonetic transcriptions for EE, IE, AE, BE, CE, SS,

XX and ZZ. Only orthographic transcriptions are allowed in the TextGrid files. The TextGrid

files will afterwards contain both orthographic and deterministic phonetic transcriptions.



2.33.2 Command Line Options



1) Transcribe.pl



Will transcribe the TextGrid files that can be found in the directories which are

specified in the list file DirectoryList.lst.



2) Transcribe.pl -f



Will only transcribe the specified TextGrid file.



3) Transcribe.pl -d



Will only transcribe the specified directory's TextGrid files.



4) Transcribe.pl -l



The newly specified list file will be used instead of DirectoryList.lst.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 53 of 78



2.33.3 Questions



1) For this script to work properly, the data must be free of normal CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)

errors15. Otherwise, answer “n” in order to abort the program.



2) WARNING: This script will convert the orthographic TextGrids to merged files.

Do you want to continue? (y/n)



Answer “y” to allow the script to convert the orthographic TextGrids to merged

TextGrid files containing both orthographic and phonetic transcriptions. Otherwise,

answer “n” in order to abort the program.



3) Please specify the language

1. EE/IE/AE/BE/CE

2. SS

3. XX

4. ZZ



Simply choose the language to be transcribed.



4) Enter name of file containing NOVAR lexicon.



Specify the name of the NOVAR lexicon16 that will be used during the transcription

process.



Note that this question will only be asked if the answer to Question 3 is “1” =>

EE/IE/AE/BE/CE.



5) Enter name of file containing grapheme-to-phon conversion rules.



Specify the name of the file holding the grapheme-to-phoneme conversion rules.

This script will be able to work with the rules of SS (Appendix B), XX (Appendix C),

and ZZ (Appendix D). Note that CheckLexicon.pl (Section 2.7) can be used to

check a grapheme-to-phoneme conversion rule lexicon for errors.



Note that this question will only be asked if the answer to Question 3 is “2”, “3” or

“4” => SS/XX/ZZ.



6) Enter name of file containing NOVAR lexicon.



Specify the name of the NOVAR lexicon that will be used during the transcription

process. Note that this lexicon must only contain the transcriptions of spelled letters

and words with internal zeros.



Note that this question will only be asked if the answer to Question 3 is “2”, “3” or

“4” => SS/XX/ZZ.



15

Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.

16

A NOVAR lexicon contains only one phonetic sequence for each orthographic word.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 54 of 78



7) Enter name of file containing BE NOVAR lexicon



Specify the name of the BE NOVAR lexicon that will be used to transcribe the

English words occurring between dollar markers.



Note that this question will only be asked if the answer to Question 3 is “2”, “3” or

“4” => SS/XX/ZZ.



2.33.4 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.



2.33.5 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files during the process of running this script.



If working with one or more directories and words with internal zeros are encountered, the

following file will be created in the startup directory:



• TextGridFilesContainingWordsWithInternalZeros.txt



This file will hold for each affected TextGrid, the words with internal zeros. The file

has the same layout as the file shown in the example of Section 2.31.5.



Note that the generation of this file is not necessary anymore, since

WordWithInternalZeros.pl can be used to obtain this information. This functionality

can therefore be removed from the script.





2.34 WordsWithInternalZeros.pl



2.34.1 Overview



Extracts words with internal zeros from orthographic transcriptions. It can work with

TextGrid files containing only orthographic transcriptions, or orthographic and phonetic

transcriptions.



2.34.2 Command Line Options



1) WordsWithInternalZeros.pl



Will search the TextGrid files that can be found in the directories which are specified

in the list file DirectoryList.lst for words with internal zeros.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 55 of 78



2) WordsWithInternalZeros.pl -f



Will only search the specified TextGrid file for words with internal zeros.



3) WordsWithInternalZeros.pl -d



Will only search the specified directory's TextGrid files for words with internal zeros.



4) WordsWithInternalZeros.pl -l



The newly specified list file will be used instead of DirectoryList.lst.



2.34.3 Questions



1) For this script to work properly, the data must be free of normal CFE errors.

Do you want to continue? (y/n)



Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)

errors17. Otherwise, answer “n” in order to abort the program.



2.34.4 Generated Output Files



See Section 2.5.5 for the files that will be generated if errors are encountered in the

TextGrid files during the process of running this script.



If working with one or more directories, the following files will be created in startup

directory if words with internal zeros are encountered:



• TextGridFilesContainingWordsWithInternalZeros.txt



This file will hold for each TextGrid, the words with internal zeros that were

encountered. The file has the same layout as the file shown in the example of

Section 2.31.5.



• WordsWithInternalZeros.txt



Holds the unique list of all the words with internal zeros that were encountered.



If working with a single TextGrid file and words with internal zeros are encountered, the

following file will be created in the startup directory:



• WordsWithInternalZeros.txt





2.35 WorkThroughFWEErrorFiles.pl



2.35.1 Overview



This script allows the user to work through those TextGrid files lying under the

FilesWithErrors (FWE) directory containing normal (determined by using

17

Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment

errors, and c) phonetic symbol errors.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 56 of 78



CheckForErrors.pl and includes utterance and sentence marker location errors,

orthographic and phonetic alignment errors, and phonetic symbol errors), SNBD or

assimilation errors. Note that the TextGrid files must be accompanied by their alaw files in

order for this script to work properly. In addition to this, FilesWithErrors.txt must also occur

in FWE directory. Also note that the FWE directory must be specified in a list file or on the

command line. Only one directory entry will be allowed in the list file.



2.35.2 Command Line Options



1) WorkThroughFWEErrorFiles.pl



Working directory will be extracted from DirectoryList.lst.



2) WorkThroughFWEErrorFiles.pl -d



Working directory will be extracted from command line.



3) WorkThroughFWEErrorFiles.pl -l



Working directory will be extracted from the newly specified list file instead of

DirectoryList.lst.



2.35.3 Questions



1) Choose one of the following

1. Start from scratch

2. Continue with previous session



Specify whether the session should start from the beginning (therefore start with the

first TextGrid in FWE) by answering “1” or to continue with a previous session by

answering “2”.



Note that this question will only be asked if FWEPreviousSession.txt exists in the

startup directory when the program is started. If not, the program will immediately

jump to Question 3.



2) Starting from scratch will overwrite the previous session's information!

Do you want to continue? (y/n)



Answer “y” if you want to overwrite the previous session’s information that is stored

in the startup directory under FWEPreviousSession.txt.



Note that this question will only be asked if the answer to Question 1 is “1”.



3) The TextGrids contain what type of errors

1. Normal CFE errors

2. SNBD (sil/nonsil/boundary/duration) errors

3. Assimilation errors



Specify the type of error that is being worked with.



Note that this question will only be asked if the answer to Question 1 is “1”.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 57 of 78





2.35.4 Programs to Install



The following program must be installed under C:\Program Files\sendpraat:



• sendpraat.exe



This executable allows the script to control Praat.



2.35.5 Files Needed during Startup



The following file must exist in FWE subdirectory in order for the script to work properly:



• FilesWithErrors.txt



This file holds the names of TextGrids containing errors as well as their associated

error messages.



2.35.6 Generated Output Files



The following files in the startup directory will constantly be updated during program

execution:



• FWEPreviousSession.txt



Current session’s information (previous session’s information during next startup)

will constantly be written to this file during program execution.



• FWEPreviousSession_bak.txt



FWEPreviousSession.txt will be saved to this file before it is updated with new

information.





2.36 WorkThroughLex.pl



2.36.1 Overview



This script will enable the user to check the data for errors based on the information

contained in the VAR lexicon18. The user will thus be allowed to work through the lexicon

and then inspect the TextGrid files of those entries that look suspicious. Note that each

TextGrid containing an error can be edited using either Notepad or Praat. In addition to

this, it's alaw can be listened to using either Awave Studio or Praat.



2.36.2 Questions



1) Choose one of the following

1. Start from scratch

2. Continue with previous session



18

A VAR lexicon can contain various phonetic sequences for each orthographic word. The phonetic sequences of an

orthographic word are separated with commas.



Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 58 of 78



Specify whether the session should start from the beginning (therefore start with the

first entry in the lexicon) by answering “1” or to continue with a previous session by

answering “2”.



Note that this question will only be asked if PreviousSession.txt exists in the startup

directory when the program is started.



2) Starting from scratch will overwrite the previous session's information!

Do you want to continue? (y/n)



Answer “y” if you want to overwrite the previous session’s information that is stored

in the startup directory under PreviousSession.txt.



Note that this question will only be asked if the answer to Question 1 is “1”.



3) Please enter the name of the file holding the VAR lexicon.



The name of the VAR lexicon must be specified that was built up using BuildLex.pl

(Section 2.3).



Note that this question will only be asked if the answer to Question 1 is “1”.



4) Please enter the name of the file holding the location info.



The name of the file holding the location information must be specified that was built

up using BuildLex.pl.



Note that this question will only be asked if the answer to Question 1 is “1”.



2.36.3 Programs to Install



The following programs must be installed in order for the script to work properly:



• sendpraat.exe



This executable allows the script to control Praat and must be installed under

C:\Program Files\sendpraat.



• Awave.exe



This executable allows the user to listen to the alaw files and must be installed

under C:\Program Files\Awave Studio.



2.36.4 Files Needed in Startup Directory



The following file must lie in the startup directory in order for the script to work properly:



• PhoneSets.lst



Contains the XSAMPA and Praat phone sets. For more information about this list

file see Appendix A.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 59 of 78



2.36.5 Generated Output Files



The following files will constantly be updated in the startup directory during program

execution:



• PreviousSession.txt



Current session’s information (previous session’s information during next startup)

will constantly be written to this file during program execution.



• PreviousSession_bak.txt



FWEPreviousSession.txt will be saved in this file before it is updated with new

information.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 60 of 78



3. USING THE SCRIPTS



3.1 Processing Raw Orthographic Transcription Files



The raw orthographic transcription files usually lie in call folders under a base directory.

Each call folder contains alaws, TextGrids and an attribute file. These files need to be

renamed and then copied to the following directories:



• alaw

• attrib

• ort



The following directories must also be created:



• lab

• merge

• phon



In addition to this, the orthographic TextGrids must be checked for transcription errors

before moving and renaming the transcription files.



The following scripts must be run in the order indicated below in order to accomplish these

tasks:



1. CheckNumberOfALawsAndTextGrids.pl (x1)



First of all the, call subdirectories must be checked to see whether they contain the

same number of alaw and orthographic TextGrid files.



2. GetTextGridDirectoriesRecursively.pl (x1)



Produce a list file with all the call folders in it. This list file will be used by the scripts

below.



3. CheckAndReplaceOrtNames.pl (x1)



Correct any orthographic transcription name errors.



4. Cleanup.pl (x1)



The transcriptions are forced to comply with certain specifications. This step is

crucial, since it lightens the workload when correcting the errors pointed out by

CheckForErrors.pl.



5. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are

removed)



Correct any transcription errors that exist in the TextGrid files using these two

scripts. Once the errors have been corrected, copy the TextGrid files back from the

FilesWithErrors directory to the TextGrid directory.







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 61 of 78



6. Cleanup.pl (x1)



Run Cleanup.pl one more time to ensure the transcriptions comply with the

specifications with regards to the use of white space.



7. RemoveInvalidIntervals.pl (x1!)



This script will remove any intervals that were incorrectly added at the beginning

and the end of the transcriptions in Praat. However, if the transcribers know what

they are doing, this step can be skipped.



8. ProcessRawTranscriptionData.pl (x1)



The transcription files will be renamed and moved to alaw, attrib, and ort directories.





3.2 Generating Deterministic Phonetic Transcriptions



The following steps must be performed in order to generated TextGrid files containing

orthographic and deterministic phonetic transcriptions. Afterwards, the data must be given

to the transcribers in order for them to perform phonetic correction on the data.



1. Rip.pl



Determine which words do not occur in the NOVAR lexicon. These words and their

phonetic representations must then be included in the lexicon. The number of times

to run this script will depend on the number of lexicons being used during the

transcription process.



2. CheckLexicon.pl (repeat until all errors are removed)



Make sure the NOVAR lexicon(s) contain(s) no errors. If a transcription rule lexicon

is also used, check it as well.



3. Transcribe.pl (x1)



Generate deterministic phonetic transcriptions using lexicon(s) and possibly

transcription rules.



4. CheckForErrors.pl (x1)



Run this script in order to make sure that there are no transcription errors.

Remember to check for orthographic and phonetic alignment errors, and phonetic

symbol errors. If errors exist, follow steps 4 to 6 of Section 3.1 to correct them.



After the deterministic phonetic transcriptions have been generated, move the TextGrid

files from the ort directory to the merge directory. Once this has been done, remove the ort

and phon folders. These directories can be removed, since they were only used when

Patana was still employed to produce the deterministic phonetic transcriptions.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 62 of 78



3.3 Processing Phonetically Corrected Data



After the transcribers have phonetically corrected the data, it must be processed and

checked for errors. The following must be done:



1. Cleanup.pl (x1)



The transcriptions are forced to comply with certain transcription specifications. This

step is crucial, since it lightens the workload when correcting the errors pointed out

by CheckForErrors.pl.



2. BottomToTop.pl (x1)



Those orthographic intervals falling outside of utterance and sentence markers will

be updated to correspond with phonetic intervals. However, if the transcribers

updated both the orthographic and phonetic intervals, this step can be skipped.



3. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are

removed)



Correct any transcription errors that exist in the TextGrid files using these two

scripts. Remember to check for orthographic and phonetic alignment errors, and

phonetic symbol errors. Once the errors have been corrected, copy the TextGrid

files back from the FilesWithErrors directory to the TextGrid directory.



4. Cleanup.pl (x1)



Run Cleanup.pl one final time to ensure the transcriptions comply with the

specifications with regards to the use of white space.



5. Converter.pl (x1)



If the phonetic transcription format is in Praat, convert it to XSAMPA.



6. GenerateLabFiles.pl (x1)



Generate lab files using information contained in TextGrid files.





3.4 Merging Phonetically Corrected Batches of Data



The batches that have been phonetically corrected must be merged at some point. The

following scripts are employed to accomplish this:



1. CopyPhoneCorrectedFiles.pl (x1)



Will copy phonetically corrected data that's scattered over several base directories

to a single target base directory containing alaw/attrib/lab/merge directories.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 63 of 78



2. RemoveDuplicateAttribs.pl (x1)



Removes duplicate attribute files that have been identified in possible earlier

merging sessions from attribute directory by looking at the call numbers contained

within each attribute file.



3. MoveDuplicateAttribs.pl (x1)



Move all remaining duplicate attribute files from attribute directory to a specified

directory. Work through these files and copy those that must be kept back to the

attribute directory.



4. DeleteRedundantALTFilesLookingAtAttribs.pl (x1)



Remove all alaw/lab/TextGrid files which do not have corresponding attribute files.



5. Renamer.pl (x1)



Rename alaw, attrib, lab and TextGrid files in order to get them according spec with

respect to case and counter.



3.5 Correcting More Errors



Once all the batches of a specific language have been merged, SNBD, assimilation and

lexicon errors must corrected.



3.5.1 Sil/Nonsil/Boundary/Duration Errors



First, the data must be sent to the engineering team. They will run programs on the data in

order to determine sil/nonsil/boundary (SNB) errors. The following scripts must then be run

on the data to correct these errors:



1. CopySNBDErrorFilesToFWE.pl (x1)



Copy those TextGrid files – and their alaws – with SNB errors to FilesWithErrors

subdirectory under the TextGrid directory.



2. WorkThroughFWEErrorFiles.pl



Work through those TextGrid files lying under the FilesWithErrors directory

containing SNB errors. Once the errors have been corrected, copy the TextGrid

files back from the FilesWithErrors directory to the TextGrid directory.



3. Cleanup.pl (x1)



The transcriptions are forced to comply with certain specifications. This step is

crucial, since it lightens the workload when correcting the errors pointed out by

CheckForErrors.pl.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 64 of 78



4. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are

removed)



Correct any transcription errors that exist in the TextGrid files using these two

scripts. Once the errors have been corrected, copy the TextGrid files back from the

FilesWithErrors directory to the TextGrid directory.



5. Cleanup.pl (x1)



Run Cleanup.pl one final time to ensure the transcriptions comply with the

specifications with regards to the use of white space.



6. GenerateLabFiles.pl (x1)



Regenerate lab files.



At this point the data is sent back to the engineers in order for them to determine where

duration errors occur. To work through these errors, follow steps 1 to 6 above.



3.5.2 Assimilation Errors



After the SNBD errors have been corrected, the data is again sent to the engineering

team. This time round, a list of possible assimilation errors is produced. Run the following

scripts on the data in order to correct these errors:



1. CopyAssimErrorFilesToFWE.pl (x1)



TextGrids (filenames extracted from list file produced by the engineers) – and their

alaws – with possible assimilation errors are copied to FilesWithErrors (FWE)

subdirectory under the TextGrid directory. The list file produced by the engineering

team must also be copied to FWE subdirectory and should be renamed to

FilesWithErrors.txt.



2. WorkThroughFWEErrorFiles.pl (x1)



Work through those TextGrid files lying under the FWE directory containing

assimilation errors. Once the errors have been corrected, copy the TextGrid files

back from the FilesWithErrors directory to the TextGrid directory.



3. Cleanup.pl (x1)



The transcriptions are forced to comply with certain specifications. This step is

crucial, since it lightens the workload when correcting the errors pointed out by

CheckForErrors.pl.



4. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are

removed)



Correct any transcription errors that exist in the TextGrid files using these two

scripts. Once the errors have been corrected, copy the TextGrid files back from the

FilesWithErrors directory to the TextGrid directory.





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 65 of 78



5. Cleanup.pl (x1)



Run Cleanup.pl one final time to ensure the transcriptions comply with the

specifications with regards to the use of white space.



6. GenerateLabFiles.pl (x1)



Regenerate lab files.



3.5.3 Lexicon Errors



A VAR lexicon must now be extracted from the data and worked through in order to

eliminate more errors from the data:



1. BuildLex.pl (x1)



Extract VAR lexicon from TextGrid files.



2. CheckLexicon.pl (repeat until all errors are removed)



Make sure the VAR lexicon contains no errors.



3. WorkThroughLex.pl



Work through VAR lexicon and correct any errors that are found in the TextGrid

files.



4. Cleanup.pl (x1)



The transcriptions are forced to comply with certain specifications. This step is

crucial, since it lightens the workload when correcting the errors pointed out by

CheckForErrors.pl.



5. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are

removed)



Correct any transcription errors that exist in the TextGrid files using these two

scripts. Once the errors have been corrected, copy the TextGrid files back from the

FilesWithErrors directory to the TextGrid directory.



6. Cleanup.pl (x1)



Run Cleanup.pl one final time to ensure the transcriptions comply with the

specifications with regards to the use of white space.



7. GenerateLabFiles.pl (x1)



Regenerate lab files.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 66 of 78



3.6 Final Processing of Information



The data, VAR lexicon and counts lexicon can be delivered to the engineering team after

the following scripts have been run on the data:



1. ExtractTranscriptions.pl (x1)



Produce summary of orthographic and phonetic transcriptions occurring in the

TextGrid files.



2. BuildLex.pl (x1)



Extract VAR lexicon from TextGrid files. In addition to this, a counts lexicon is also

produced.



3. CheckLexicon.pl (x1)



Make sure the VAR lexicon contains no errors.









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 67 of 78



APPENDIX A – PHONESETS.LST





XSAMPA Praat

p p

p_h p^h

p_> p'

p-\S_> p\-v\sh'

p-\S_h p\-v\sh^h

b b

b_0 b\0v

b-\Z b\-v\zh

t t

t_h t^h

t_> t'

t-\K_> t\-v\l-'

t-\S t\-v\sh

t-\S_h t\-v\sh^h

t-\s_> t\-vs'

t-\s_h t\-vs^h

d d

d_0 d\0v

d-\Z d\-v\zh

d-\K\ d\-v\lz

d-\z d\-vz

c_> c'

c_h c^h

J\ \j-

K k

k_h k^h

k_> k'

k-\x_> k\-vx'

g \gs

? \?g

m m

m= m\|v

n n

J \nj

N \ng

r r

R\ \rc

4 \fh

f f

v v

T \te

D \dh

s s

z z

S \sh

Z \zh

x x





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 68 of 78







h h

h\ \h^

K \l-

K\ \lz

r\ \rt

j j

l l

|\ \|1

|\~ \|1\~^

|\_v~ \|1\~^\vv

|\_v \|1\vv

|\_h \|1^h

|\|\ \|2

|\|\~ \|2\~^

|\|\_v~ \|2\~^\vv

|\|\_v \|2\vv

|\|\_h \|2^h

!\ !

!\~ !\~^

!\_v~ !\~^\vv

!\_v !\vv

!\_h !^h

b_]

k=[k_>]



ll=[l=]

l=[l]



mm=[m=]

m=[m]



ntjh=[J t-\S_h]

ntj=[J t-\S]

ny=[J]

nq=[!\~]

nk=[N k_>]

ng=[N]

n=[n]







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 72 of 78







oCCCu=[o]

oCCCi=[o]

oCCu=[o]

oCCi=[o]

oCu=[o]

oCi=[o]

oo=[O:]

o=[O]



pjh=[p-\S_h]

pj=[p-\S_>]

ph=[p_h]

p=[p_>]



qh=[!\_h]

q=[!\]



rr=[r=]

r=[r]



sh=[S]

s=[s]



tsh=[t-\s_h]

tlh=[t-\K_h]

tjh=[t-\S_h]

ts=[t-\s_>]

tl=[t-\K_>]

tj=[t-\S]

th=[t_h]

t=[t_>]



uu=[u:]

u=[u]



w=[w]



y=[j]









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 73 of 78



APPENDIX C – XHOSA TRANSCRIPTION RULES





Cam_=[a m=]

aa=[a:]

a=[a]



bh=[b_0]

b=[b_]

k=[k_>]



l=[l]



_mna_=[m= n a]

mhl=[m K]

mb=[m b]

mh=[m]

m=[m]



ntyh=[J c_h]

ntsh=[J t-\s_>]

ndl=[n d-\K\]

ndy=[J J\]

ngc=[|\_v~]

ngq=[!\_v~]

ngx=[|\|\_v~]

nkc=[N |\]

nkh=[N k_h]

nkq=[N !\]

nkw=[N k_> w]

nkx=[N |\|\]

nty=[J c_>]

nyh=[J]

nkV=[N k_>]

nc=[|\~]

ng=[N g]

nj=[J d-\Z]

nq=[!\~]

nx=[|\|\~]

ny=[J]

nz=[n d-\z]

n=[n]



oCCCCi=[o]

oCCCCu=[o]

oCCCi=[o]

oCCCu=[o]

oCCi=[o]

oCCu=[o]

oCi=[o]

oCu=[o]

oo=[O:]

o=[O]



ph=[p_h]





Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 75 of 78







p=[p_>]



qh=[!\_h]

q=[!\]



rh=[x]

r=[r]



sh=[S]

s=[s]



ths=[t-\s_>]

tsh=[t-\S_h]

tyh=[c_h]

th=[t_h]

tl=[t-\K_>]

ts=[t-\s_>]

ty=[c_>]

t=[t_>]



_umbh=[u m= b]

_umb=[u m= b]

_umC=[u m=]

uu=[u:]

u=[u]



v=[v]



w=[w]



xh=[|\|\_h]

x=[|\|\]



y=[j]



z=[z]









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 76 of 78



APPENDIX D – ZULU TRANSCRIPTION RULES





aa=[a:]

a=[a]



bh=[b_0]

b=[b_]

k=[k_>]



l=[l]



mhl=[m K]

mb=[m b]

m=[m]



ntsh=[J t S]

ndl=[n d-\K\]

ngc=[|\_v~]

ngq=[!\_v~]

ngx=[|\|\_v~]

nkc=[N |\]

nkq=[N !\]

nkw=[N k_> w]

nkx=[N |\|\]

nkV=[N k_>]

nc=[|\~]

ng=[N g]

nj=[n d-\Z]

nq=[!\~]

nx=[|\|\~]

ny=[J]

nz=[n d-\z]

n=[n]



oCCCCi=[o]

oCCCCu=[o]

oCCCi=[o]

oCCCu=[o]

oCCi=[o]

oCCu=[o]

oCi=[o]

oCu=[o]

oo=[O:]

o=[O]



ph=[p_h]

p=[p_>]



qh=[!\_h]

q=[!\]



r=[r]



sh=[S]

s=[s]







Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation

AST Confidential Page 78 of 78







tsh=[t-\S_h]

th=[t_h]

ts=[t-\s_>]

t=[t_>]



uu=[u:]

u=[u]



v=[v]



w=[w]



xh=[|\|\_h]

x=[|\|\]



y=[j]



z=[z]









Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation


Related docs
Other docs by liamei12345
BRYAN COUNTY PLANNING _ ZONING DEPARTMENT
Views: 0  |  Downloads: 0
League 2
Views: 0  |  Downloads: 0
New Postdoc Arrival Checklist for Postdocs
Views: 1  |  Downloads: 0
Learning Styles Assessment.xlsx
Views: 0  |  Downloads: 0
The Baptism of the Lord
Views: 1  |  Downloads: 0
ppbiosketch
Views: 0  |  Downloads: 0
Project_thesis_introduction_2006_20071
Views: 0  |  Downloads: 0
Figure10
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!