AFRICAN SPEECH TECHNOLOGY
TECHNICAL REPORT 26
Perl Scripts for AST Database Validation,
Quality Control, and Manipulation.
November 2003
DACST Innovation Fund Project 21213
Consortium Members: University of Stellenbosch, University of Pretoria, West Technology Holdings Pty Ltd,
South African Foundation for Language and Speech Technology Development
Research Unit for Experimental Phonology, University of Stellenbosch, Private Bag X1, Matieland 7602, South Africa.
E-mail: (Administrative) jcr@maties.sun.ac.za (Technical): botha@sun.ac.za / dupreez@dsp.sun.sc.za
Tel +27 (0)21 8082106 Fax +21 (0)21 8083975 http://www.ast.sun.ac.za
AST Confidential Page 2 of 78
Identification number DACST 2193 (AST) T26
Type Technical Report
Title Perl Scripts for AST Database Validation, Quality Control, and
Manipulation.
Status Final
Date November 2003
Version 1.0
Number of pages 78
Author(s) M.W. Theunissen, mtheunis@sun.ac.za
Project co-ordinator Justus Roux
e-mail jcr@maties.sun.ac.za
http://www.ast.sun.ac.za
Access Confidential
Key words
Abstract This document takes a look at the Perl scripts that were
developed for AST database validation,
quality control, and manipulation. A description of each script
will be given. In addition to this, it will be explained how to use
these script in conjunction with each other.
Actual Distribution
Supplementary notes
DOCUMENT EVOLUTION
Version Date Status Notes
1.0 November Final
2003
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 3 of 78
Contents
1. Introduction ....................................................................................................................7
2. Description of AST Scripts ...........................................................................................7
2.1 Assimilations.pl ...................................................................................................................................................... 7
2.1.1 Overview.......................................................................................................................................................... 7
2.1.2 Command Line Options ................................................................................................................................ 7
2.1.3 Questions ........................................................................................................................................................ 7
2.1.4 Generated Output Files ................................................................................................................................. 7
2.2 BottomToTop.pl ..................................................................................................................................................... 8
2.2.1 Overview.......................................................................................................................................................... 8
2.2.2 Command Line Options ................................................................................................................................ 8
2.2.3 Questions ........................................................................................................................................................ 9
2.2.4 Generated Output Files ................................................................................................................................. 9
2.3 BuildLex.pl .............................................................................................................................................................. 9
2.3.1 Overview.......................................................................................................................................................... 9
2.3.2 Command Line Options ................................................................................................................................ 9
2.3.3 Questions ...................................................................................................................................................... 10
2.3.4 Files Needed in Startup Directory.............................................................................................................. 18
2.3.5 Generated Output Files ............................................................................................................................... 18
2.4 CheckAndReplaceOrtNames.pl ........................................................................................................................ 18
2.4.1 Overview........................................................................................................................................................ 18
2.4.2 Command Line Options .............................................................................................................................. 19
2.4.3 Generated Output Files ............................................................................................................................... 19
2.5 CheckForErrors.pl ............................................................................................................................................... 19
2.5.1 Overview........................................................................................................................................................ 19
2.5.2 Command Line Options .............................................................................................................................. 19
2.5.3 Questions ...................................................................................................................................................... 20
2.5.4 Files Needed in Startup Directory.............................................................................................................. 20
2.5.5 Generated Output Files ............................................................................................................................... 21
2.6 CheckForInvalidIntervals.pl................................................................................................................................ 21
2.6.1 Overview........................................................................................................................................................ 21
2.6.2 Command Line Options .............................................................................................................................. 21
2.6.3 Generated Output Files ............................................................................................................................... 22
2.7 CheckLexicon.pl .................................................................................................................................................. 22
2.7.1 Overview........................................................................................................................................................ 22
2.7.2 Questions ...................................................................................................................................................... 22
2.7.3 Files Needed in Startup Directory.............................................................................................................. 22
2.7.4 Generated Output Files ............................................................................................................................... 22
2.8 CheckNumberOfALawsAndTextGrids.pl.......................................................................................................... 24
2.8.1 Overview........................................................................................................................................................ 24
2.8.2 Command Line Options .............................................................................................................................. 24
2.8.3 Generated Output Files ............................................................................................................................... 24
2.9 Cleanup.pl............................................................................................................................................................. 25
2.9.1 Overview........................................................................................................................................................ 25
2.9.2 Command Line Options .............................................................................................................................. 25
2.9.3 Questions ...................................................................................................................................................... 25
2.9.4 Generated Output Files ............................................................................................................................... 25
2.10 CompareAlawAndTextGridFilenames.pl........................................................................................................ 25
2.10.1 Overview ..................................................................................................................................................... 25
2.10.2 Command Line Options ............................................................................................................................ 26
2.10.3 Generated Output Files............................................................................................................................. 26
2.11 Converter.pl........................................................................................................................................................ 26
2.11.1 Overview ..................................................................................................................................................... 26
2.11.2 Command Line Options ............................................................................................................................ 26
2.11.3 Questions .................................................................................................................................................... 27
2.11.4 Files Needed in Startup Directory............................................................................................................ 27
2.11.5 Generated Output Files............................................................................................................................. 27
2.12 CopyAssimErrorFilesToFWE.pl ...................................................................................................................... 27
2.12.1 Overview ..................................................................................................................................................... 27
2.12.2 Command Line Options ............................................................................................................................ 28
2.12.3 Questions .................................................................................................................................................... 28
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 4 of 78
2.13 CopyPhoneCorrectedFiles.pl........................................................................................................................... 29
2.13.1 Overview ..................................................................................................................................................... 29
2.13.2 Command Line Options ............................................................................................................................ 29
2.13.3 Questions .................................................................................................................................................... 29
2.14 CopySNBDErrorFilesToFWE.pl ...................................................................................................................... 29
2.14.1 Overview ..................................................................................................................................................... 29
2.14.2 Command Line Options ............................................................................................................................ 29
2.14.3 Questions .................................................................................................................................................... 30
2.15 DeleteRedundantALTFilesLookingAtAttribs.pl.............................................................................................. 32
2.15.1 Overview ..................................................................................................................................................... 32
2.15.2 Command Line Options ............................................................................................................................ 32
2.16 ExtractTranscriptions.pl.................................................................................................................................... 33
2.16.1 Overview ..................................................................................................................................................... 33
2.16.2 Command Line Options ............................................................................................................................ 33
2.16.3 Questions .................................................................................................................................................... 33
2.16.4 Generated Output Files............................................................................................................................. 33
2.17 GenerateLabFiles.pl.......................................................................................................................................... 35
2.17.1 Overview ..................................................................................................................................................... 35
2.17.2 Command Line Options ............................................................................................................................ 35
2.17.3 Files Needed in Startup Directory............................................................................................................ 36
2.17.4 Generated Output Files............................................................................................................................. 36
2.18 GenOrt.pl ............................................................................................................................................................ 38
2.18.1 Overview ..................................................................................................................................................... 38
2.18.2 Command Line Options ............................................................................................................................ 38
2.18.3 Questions .................................................................................................................................................... 39
2.18.4 Generated Output Files............................................................................................................................. 39
2.19 GetPronouncedAcronyms.pl............................................................................................................................ 39
2.19.1 Overview ..................................................................................................................................................... 39
2.19.2 Command Line Options ............................................................................................................................ 39
2.19.3 Questions .................................................................................................................................................... 39
2.19.4 Generated Output Files............................................................................................................................. 40
2.20 GetTextGridDirectoriesRecursively.pl ............................................................................................................ 40
2.20.1 Overview ..................................................................................................................................................... 40
2.20.2 Command Line Options ............................................................................................................................ 40
2.20.3 Generated Output Files............................................................................................................................. 40
2.21 Lin2Dos.pl........................................................................................................................................................... 40
2.21.1 Overview ..................................................................................................................................................... 40
2.21.2 Command Line Options ............................................................................................................................ 41
2.22 MoveDuplicateAttribs.pl.................................................................................................................................... 41
2.22.1 Overview ..................................................................................................................................................... 41
2.22.2 Command Line Options ............................................................................................................................ 41
2.22.3 Questions .................................................................................................................................................... 41
2.23 MovePhoneticTextGrids.pl............................................................................................................................... 41
2.23.1 Overview ..................................................................................................................................................... 41
2.23.2 Command Line Options ............................................................................................................................ 42
2.24 NumberOfMalesAndFemales.pl ...................................................................................................................... 42
2.24.1 Overview ..................................................................................................................................................... 42
2.24.2 Command Line Options ............................................................................................................................ 42
2.25 ProcessRawTranscriptionData.pl.................................................................................................................... 42
2.25.1 Overview ..................................................................................................................................................... 42
2.25.2 Command Line Options ............................................................................................................................ 43
2.25.3 Questions .................................................................................................................................................... 43
2.25.4 Generated Output Files............................................................................................................................. 43
2.26 RemoveDuplicateAttrib.pl................................................................................................................................. 43
2.26.1 Overview ..................................................................................................................................................... 43
2.26.2 Command Line Options ............................................................................................................................ 44
2.26.3 Questions .................................................................................................................................................... 44
2.27 RemoveInvalidIntervals.pl................................................................................................................................ 44
2.27.1 Overview ..................................................................................................................................................... 44
2.27.2 Command Line Options ............................................................................................................................ 44
2.27.3 Generated Output Files............................................................................................................................. 45
2.28 RemoveTextGridFilesWithErrors.pl ................................................................................................................ 45
2.28.1 Overview ..................................................................................................................................................... 45
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 5 of 78
2.28.2 Command Line Options ............................................................................................................................ 45
2.28.3 Questions .................................................................................................................................................... 45
2.29 Renamer.pl......................................................................................................................................................... 46
2.29.1 Overview ..................................................................................................................................................... 46
2.29.2 Command Line Options ............................................................................................................................ 46
2.29.3 Questions .................................................................................................................................................... 46
2.30 ReplaceAssimilations.pl ................................................................................................................................... 46
2.30.1 Overview ..................................................................................................................................................... 46
2.30.2 Command Line Options ............................................................................................................................ 46
2.30.3 Questions .................................................................................................................................................... 47
2.30.4 Generated Output Files............................................................................................................................. 48
2.31 Rip.pl ................................................................................................................................................................... 48
2.31.1 Overview ..................................................................................................................................................... 48
2.31.2 Command Line Options ............................................................................................................................ 48
2.31.3 Questions .................................................................................................................................................... 48
2.31.4 Files Needed in Startup Directory............................................................................................................ 49
2.31.5 Generated Output Files............................................................................................................................. 49
2.32 SubstitutePhonCharacters.pl........................................................................................................................... 51
2.32.1 Overview ..................................................................................................................................................... 51
2.32.2 Command Line Options ............................................................................................................................ 51
2.32.3 Files Needed in Startup Directory............................................................................................................ 52
2.32.4 Generated Output Files............................................................................................................................. 52
2.33 Transcribe.pl ...................................................................................................................................................... 52
2.33.1 Overview ..................................................................................................................................................... 52
2.33.2 Command Line Options ............................................................................................................................ 52
2.33.3 Questions .................................................................................................................................................... 53
2.33.4 Files Needed in Startup Directory............................................................................................................ 54
2.33.5 Generated Output Files............................................................................................................................. 54
2.34 WordsWithInternalZeros.pl .............................................................................................................................. 54
2.34.1 Overview ..................................................................................................................................................... 54
2.34.2 Command Line Options ............................................................................................................................ 54
2.34.3 Questions .................................................................................................................................................... 55
2.34.4 Generated Output Files............................................................................................................................. 55
2.35 WorkThroughFWEErrorFiles.pl ....................................................................................................................... 55
2.35.1 Overview ..................................................................................................................................................... 55
2.35.2 Command Line Options ............................................................................................................................ 56
2.35.3 Questions .................................................................................................................................................... 56
2.35.4 Programs to Install..................................................................................................................................... 57
2.35.5 Files Needed during Startup..................................................................................................................... 57
2.35.6 Generated Output Files............................................................................................................................. 57
2.36 WorkThroughLex.pl........................................................................................................................................... 57
2.36.1 Overview ..................................................................................................................................................... 57
2.36.2 Questions .................................................................................................................................................... 57
2.36.3 Programs to Install..................................................................................................................................... 58
2.36.4 Files Needed in Startup Directory............................................................................................................ 58
2.36.5 Generated Output Files............................................................................................................................. 59
3. Using the Scripts .........................................................................................................60
3.1 Processing Raw Orthographic Transcription Files ......................................................................................... 60
3.2 Generating Deterministic Phonetic Transcriptions ......................................................................................... 61
3.3 Processing Phonetically Corrected Data ......................................................................................................... 62
3.4 Merging Phonetically Corrected Batches of Data........................................................................................... 62
3.5 Correcting More Errors ....................................................................................................................................... 63
3.5.1 Sil/Nonsil/Boundary/Duration Errors.......................................................................................................... 63
3.5.2 Assimilation Errors ....................................................................................................................................... 64
3.5.3 Lexicon Errors............................................................................................................................................... 65
3.6 Final Processing of Information......................................................................................................................... 66
Appendix A – PhoneSets.lst ...........................................................................................67
Appendix B – Sesotho Transcription Rules ..................................................................71
Appendix C – Xhosa Transcription Rules......................................................................73
Appendix D – Zulu Transcription Rules.........................................................................76
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 6 of 78
Acronyms
ALT Alaw/Lab/TextGrid
AST African Speech Technology
CFE CheckForErrors.pl
CIS Case Insensitive Search
CSS Case Sensitive Search
FWE FilesWithErrors
SNB Sil/Nonsil/Boundary
SNBD Sil/Nonsil/Boundary/Duration
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 7 of 78
1. INTRODUCTION
In this report, a short description will be given of each of the Perl scripts that were used for
AST database validation, quality control, and manipulation. In addition to this, a list will be
given showing the order in which scripts should be run.
2. DESCRIPTION OF AST SCRIPTS
2.1 Assimilations.pl
2.1.1 Overview
Extracts assimilations (#1..#2..#3) from orthographic transcriptions. TextGrid files are
allowed to contain only orthographic transcriptions, or orthographic and phonetic
transcriptions.
2.1.2 Command Line Options
1) Assimilations.pl
Will extract assimilations from the TextGrid files that can be found in the directories
which are specified in the list file DirectoryList.lst.
2) Assimilations.pl -f
Will only extract assimilations from the specified TextGrid file.
3) Assimilations.pl -d
Will only extract assimilations from the specified directory's TextGrid files.
4) Assimilations.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.1.3 Questions
1) For this script to work properly, the data must be free of normal CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)
errors1. Otherwise, answer “n” in order to abort the program.
2.1.4 Generated Output Files
If one or more errors are found in the TextGrids, the following file will be created in the
original startup directory before program execution is stopped:
1
Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 8 of 78
• DirectoriesWithErrors.txt
Holds the names of all the directories in which TextGrid files with errors were found.
Note that this file will only be created if working in one or more directories.
In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)
subdirectory will be created in each affected directory that will hold the TextGrid files
containing the errors. Their alaw files (lying in the working directory or alaw directory) will
be copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the
FWE folder and will contain an error message for each affected TextGrid file. Note that to
work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).
If assimilations are found, the following files will be created in the original startup directory:
• Assimilations.txt
Holds the list of all the unique assimilations that were found in the TextGrid files.
• TextGridFilesContainingAssimilations.txt
Holds the list of all TextGrid files containing assimilations. In addition to this, the
assimilations associated with each TextGrid will be displayed. Note that this file will
only be created if looking for assimilations in one or more directories.
2.2 BottomToTop.pl
2.2.1 Overview
Orthographic intervals are modified in order to correspond with phonetic intervals. Only
those intervals falling outside utterance and sentence markers will be updated. Note that
the TextGrid files must contain orthographic and phonetic transcriptions.
2.2.2 Command Line Options
1) BottomToTop.pl
Will update the TextGrid files that can be found in the directories which are
specified in the list file DirectoryList.lst.
2) BottomToTop.pl -f
Will only update the specified TextGrid file.
3) BottomToTop.pl -d
Will only update the specified directory's TextGrid files.
4) BottomToTop.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 9 of 78
2.2.3 Questions
1) For this script to work properly, the data must be free of normal CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)
errors2. Otherwise, answer “n” in order to abort the program.
2.2.4 Generated Output Files
If one or more errors are found in the TextGrids, the following file will be created in the
original startup directory before program execution is stopped:
• DirectoriesWithErrors.txt
Holds the names of all the directories in which TextGrid files with errors were found.
Note that this file will only be created if working in one or more directories.
In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)
subdirectory will be created in each affected directory that will hold the TextGrid files
containing the errors. Their alaw files (lying in working directory or alaw directory) will be
copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the
FWE folder and will contain an error message for each affected TextGrid file. Note that to
work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).
If one or more directories were modified the following file will be created in the original
startup directory:
• DirectoriesWithUpdates.txt
Holds the names of all the directories in which TextGrid files were updated. Note
that this file will only be created if working in one or more directories.
2.3 BuildLex.pl
2.3.1 Overview
Builds VAR lexicon3 from the orthographic and phonetic information contained within a
batch of TextGrid files. TextGrids must therefore contain both orthographic and phonetic
transcriptions.
2.3.2 Command Line Options
1) BuildLex.pl
Will build lexicon from TextGrid files that can be found in the directories which are
specified in the list file DirectoryList.lst.
2
Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
3
A VAR lexicon can contain various phonetic sequences for each orthographic word. The phonetic sequences of an
orthographic word are separated with commas.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 10 of 78
2) BuildLex.pl -d
Will only build lexicon from the specified directory's TextGrid files.
3) BuildLex.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.3.3 Questions
1) For this script to work properly, the data must be free of ALL CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of all CFE (CheckForErrors.pl) errors4.
Otherwise, answer “n” in order to abort the program.
2) For this script to work properly, the data must be in XSAMPA format.
Do you want to continue? (y/n)
Answer “y” to this question if the data is in XSAMPA format. Otherwise, answer “n”
in order to abort the program. Note that if the phonetic transcriptions are in Praat
format use Converter.pl (Section 2.11) to change it to XSAMPA.
3) Please enter the name of the VAR lexicon.
This file will hold the extracted VAR lexicon. An example of a section of such a
lexicon is shown below:
.
.
.
but=[b V t,b a t,b V,b @,b @ t]
but;its=[b V 4 I t s]
but;the=[b V t @,b V d @,b V D @,b @ t @,b a d @,b @ d @,b V D I,b V t I,b a
t @,b V t @ ?,b V d E:,b V t 3:,b V t ?,b V 4 @,b V d I,b V t]
but;the;*elen+=[b a d 9 l I n]
but;the;elementary=[b V d @ { l @ m E n t R\ I,b V D { l @ m E n t r\ i:,b V D @ l
@ m E n t r\ I,b V D E l @ m E n t r\ i,b V D E l @ m E n t @ r\ i:,b V d @ l @ m
E n t r\ I,b D E: l @ m E n t r\ y,b a D E l @ m E n t r i:,b a D E l @ m E n t r\ i,b
V d E: l @ m @ n t r\ i:]
butterfly=[b V t f V-\I,b V t @ f l V-\I]
butterscotch=[b a t @ r s k Q t-\S,b a t @ s k Q t-\S]
buy=[b a-\i]
buying=[b a-\i j I N]
by=[b V-\I,b a-\i,b @-\i,b a-\I]
by;a=[b V-\I]
ca=[k { ?]
4
All CFE errors include a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 11 of 78
cab;broke=[k {: b r\ @-\u k]
cafeteria=[k { f @ t I-\@ r\ I j @]
calf=[k A: f]
call=[k O: l,k O:,k O l]
.
.
.
Note that the phones are separated by spaces and that commas are used to
separate the different phonetic representations of an orthographic word. An
orthographic word’s most frequent phonetic sequence will occur first in the list, while
the least frequent one will be written last.
The extracted lexicon can be checked using CheckLexicon.pl (Section 2.7).
4) Please enter the name of the a priori lexicon.
This file will hold the a priori probabilities of each phonetic sequence in the lexicon.
The data will be saved in the same structure used to store the VAR lexicon.
Therefore, the a priori probabilities will be written to the locations where the
phonetic sequences would normally have stood. For the section of the extracted AE
VAR lexicon above, the a priori lexicon looks as follows:
.
.
.
but=[0.00190174326465927,0.00135587251276633,0.000792393026941363,0.
000158478605388273,0.000158478605388273]
but;its=[1.76087339320303e-005]
but;the=[0.000563479485824969,0.000158478605388273,0.000105652403592
182,7.04349357281212e-005,5.28262017960909e-005,3.52174678640606e-
005,3.52174678640606e-005,3.52174678640606e-005,1.76087339320303e-
005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-
005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-
005,1.76087339320303e-005]
but;the;*elen+=[1.76087339320303e-005]
but;the;elementary=[1.76087339320303e-005,1.76087339320303e-
005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-
005,1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-
005,1.76087339320303e-005,1.76087339320303e-005]
butterfly=[1.76087339320303e-005,1.76087339320303e-005]
butterscotch=[3.52174678640606e-005,1.76087339320303e-005]
buy=[1.76087339320303e-005]
buying=[1.76087339320303e-005]
by=[0.000281739742912485,0.000123261137524212,5.28262017960909e-
005,3.52174678640606e-005]
by;a=[1.76087339320303e-005]
ca=[1.76087339320303e-005]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 12 of 78
cab=[1.76087339320303e-005,1.76087339320303e-005,1.76087339320303e-
005]
cab;broke=[1.76087339320303e-005]
cafeteria=[1.76087339320303e-005]
calf=[1.76087339320303e-005]
call=[0.000669131889417151,8.80436696601514e-005,1.76087339320303e-
005]
.
.
.
Note that this file cannot be checked with CheckLexicon.pl.
5) Please enter the name of the counts lexicon.
This file will hold the phonetic sequence counts - the number of times each phonetic
sequence occurs in the TextGrid files. The data will be saved in the same structure
used to store the VAR lexicon. Therefore, the counts will be written to the locations
where the phonetic sequences would normally have stood. For the section of the
extracted AE VAR lexicon above, the counts lexicon looks as follows:
.
.
.
but=[108,77,45,9,9]
but;its=[1]
but;the=[32,9,6,4,3,2,2,2,1,1,1,1,1,1,1,1]
but;the;*elen+=[1]
but;the;elementary=[1,1,1,1,1,1,1,1,1,1]
butterfly=[1,1]
butterscotch=[2,1]
buy=[1]
buying=[1]
by=[16,7,3,2]
by;a=[1]
ca=[1]
cab=[1,1,1]
cab;broke=[1]
cafeteria=[1]
calf=[1]
call=[38,5,1]
.
.
.
Note that this file cannot be checked with CheckLexicon.pl.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 13 of 78
6) Please enter the name of the file that will hold the location info.
This file will hold the information that will be needed to locate the TextGrid files that
are associated with each phonetic sequence in the VAR lexicon. For the section of
the extracted AE VAR lexicon above, the location information is shown below:
.
.
.
but bVt K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE003LM034.TextGrid AE005LM024.TextGrid AE006LF027.TextGrid
AE007LF034.TextGrid AE012CM040.TextGrid AE016CF024.TextGrid
AE017CF033.TextGrid AE018CF040.TextGrid AE024LM032.TextGrid
AE026CF022.TextGrid AE028CF032.TextGrid AE028LF032.TextGrid
AE035LM033.TextGrid AE042CM029.TextGrid AE046LF024.TextGrid
AE048LF021.TextGrid AE055CM021.TextGrid AE066LF039.TextGrid
AE067LF034.TextGrid AE068LF019.TextGrid AE070LF035.TextGrid
AE075CM026.TextGrid AE081LM025.TextGrid AE087LF040.TextGrid
AE088LF035.TextGrid AE089LF032.TextGrid AE094CM023.TextGrid
AE106LF025.TextGrid AE110CF039.TextGrid AE115CM021.TextGrid
AE120CF029.TextGrid AE126LF035.TextGrid AE128LF040.TextGrid
AE129LF039.TextGrid AE130LF027.TextGrid AE131CM030.TextGrid
AE132CM034.TextGrid AE133LM031.TextGrid AE134CM022.TextGrid
AE135CM025.TextGrid AE137CF024.TextGrid AE147LM021.TextGrid
AE154CM028.TextGrid AE158CF037.TextGrid AE161LM028.TextGrid
AE163LM022.TextGrid AE167LF040.TextGrid AE168LF039.TextGrid
AE169LF023.TextGrid AE173CM037.TextGrid AE174CM030.TextGrid
AE180CF028.TextGrid AE183LM040.TextGrid AE185LM030.TextGrid
AE187LF039.TextGrid AE189LF027.TextGrid AE190LF023.TextGrid
AE191CM029.TextGrid AE192CM038.TextGrid AE194CM037.TextGrid
AE196CF030.TextGrid AE200CF026.TextGrid AE203LF024.TextGrid
AE205LF027.TextGrid AE206LF039.TextGrid AE207LF023.TextGrid
AE212LF039.TextGrid AE214LM021.TextGrid AE216LF032.TextGrid
AE217LM027.TextGrid AE225LM035.TextGrid AE232LM033.TextGrid
AE251CF025.TextGrid AE253CF030.TextGrid AE255LM025.TextGrid
AE257LF025.TextGrid AE258LM031.TextGrid AE261CM039.TextGrid
AE263CF022.TextGrid AE265LF031.TextGrid AE266LM032.TextGrid
AE267LM025.TextGrid AE268LF029.TextGrid AE269CF028.TextGrid
AE270CF027.TextGrid AE277CM039.TextGrid AE283LM032.TextGrid
AE289LF023.TextGrid AE291CF027.TextGrid AE295LM023.TextGrid
AE300LF037.TextGrid AE312CF039.TextGrid AE314CM038.TextGrid
AE325LF037.TextGrid AE344LF031.TextGrid AE352LF034.TextGrid
AE359CF027.TextGrid AE365LF034.TextGrid AE373LF027.TextGrid
AE374LF032.TextGrid AE375CF029.TextGrid AE379LF022.TextGrid
AE381LF030.TextGrid AE384CF038.TextGrid AE385CF040.TextGrid
AE387LF026.TextGrid AE390LM022.TextGrid AE390LM030.TextGrid
but bat K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE004LM034.TextGrid AE015LM023.TextGrid AE018CF027.TextGrid
AE018CF035.TextGrid AE019CF037.TextGrid AE021LM028.TextGrid
AE023LM032.TextGrid AE032LM025.TextGrid AE036LF027.TextGrid
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 14 of 78
AE039LF026.TextGrid AE048LF021.TextGrid AE048LF025.TextGrid
AE054CM031.TextGrid AE069LF038.TextGrid AE077CF035.TextGrid
AE081LM035.TextGrid AE082LM035.TextGrid AE083LM035.TextGrid
AE091CM036.TextGrid AE092CM023.TextGrid AE093CM034.TextGrid
AE096CF030.TextGrid AE097LF033.TextGrid AE098LF032.TextGrid
AE100LF021.TextGrid AE103LM027.TextGrid AE104LM034.TextGrid
AE105LM036.TextGrid AE108LF036.TextGrid AE109LF038.TextGrid
AE111CM036.TextGrid AE112CM035.TextGrid AE113CM038.TextGrid
AE115CM020.TextGrid AE116CF038.TextGrid AE119CF040.TextGrid
AE121LM035.TextGrid AE122LM034.TextGrid AE124LM024.TextGrid
AE125LM028.TextGrid AE127LF028.TextGrid AE133LM027.TextGrid
AE138LF039.TextGrid AE139CF025.TextGrid AE141LM034.TextGrid
AE144LM032.TextGrid AE145LM035.TextGrid AE150LF031.TextGrid
AE152CM026.TextGrid AE155LM038.TextGrid AE157CF040.TextGrid
AE160LF021.TextGrid AE160LF025.TextGrid AE172CM038.TextGrid
AE178CF031.TextGrid AE181LM027.TextGrid AE184LM039.TextGrid
AE186LF031.TextGrid AE198CF030.TextGrid AE199LF040.TextGrid
AE204LM033.TextGrid AE219CM023.TextGrid AE226LM022.TextGrid
AE230CF039.TextGrid AE241LF038.TextGrid AE252CF038.TextGrid
AE264CF021.TextGrid AE332LM032.TextGrid AE344LF019.TextGrid
AE344LF022.TextGrid AE351LF021.TextGrid AE351LF032.TextGrid
AE372LM030.TextGrid AE372LM031.TextGrid AE377LF027.TextGrid
AE386CM029.TextGrid AE393LM030.TextGrid
but bV K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE029LF035.TextGrid AE031CF028.TextGrid AE033CM027.TextGrid
AE038CF038.TextGrid AE041LM022.TextGrid AE043LM022.TextGrid
AE044LM039.TextGrid AE049LF033.TextGrid AE050LF032.TextGrid
AE051CM038.TextGrid AE052CM034.TextGrid AE053CM033.TextGrid
AE058LF029.TextGrid AE060CF021.TextGrid AE063LM029.TextGrid
AE065LM029.TextGrid AE068LF021.TextGrid AE101LM035.TextGrid
AE162LM022.TextGrid AE179CF029.TextGrid AE208LF021.TextGrid
AE211CM031.TextGrid AE218CF040.TextGrid AE221LM036.TextGrid
AE223LF025.TextGrid AE227CM034.TextGrid AE231LF036.TextGrid
AE284LM025.TextGrid AE288CM040.TextGrid AE290LM037.TextGrid
AE311CM035.TextGrid AE314CM039.TextGrid AE314LF023.TextGrid
AE315CM023.TextGrid AE320CM022.TextGrid AE323LF028.TextGrid
AE324CF034.TextGrid AE331CM027.TextGrid AE343CM037.TextGrid
AE346LF034.TextGrid AE348LF036.TextGrid AE353LF036.TextGrid
AE361LF030.TextGrid AE366LF034.TextGrid AE376CM035.TextGrid
but b@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE020CF025.TextGrid AE047LF025.TextGrid AE056CF021.TextGrid
AE071CM029.TextGrid AE153CF023.TextGrid AE210CF031.TextGrid
AE213CM026.TextGrid AE318LM028.TextGrid AE330CF021.TextGrid
but b@t K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE002LM039.TextGrid AE045LM039.TextGrid AE170LF021.TextGrid
AE195CM037.TextGrid AE215CF034.TextGrid AE216LF022.TextGrid
AE259LM021.TextGrid AE345LM032.TextGrid AE378LF036.TextGrid
but;its bV4Its K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE293LM031.TextGrid
but;the bVt@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE008CF028.TextGrid AE013LM024.TextGrid AE024LM033.TextGrid
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 15 of 78
AE056CF024.TextGrid AE059CF036.TextGrid AE072CM023.TextGrid
AE080CF034.TextGrid AE099CF029.TextGrid AE148LF035.TextGrid
AE166LF033.TextGrid AE175CM028.TextGrid AE176CF036.TextGrid
AE188LF029.TextGrid AE222LF036.TextGrid AE268LF040.TextGrid
AE275CF023.TextGrid AE276CM024.TextGrid AE281CF020.TextGrid
AE281CF021.TextGrid AE285LF023.TextGrid AE292LM037.TextGrid
AE296LM035.TextGrid AE297LF025.TextGrid AE329LF033.TextGrid
AE338CM036.TextGrid AE347LM027.TextGrid AE350LF023.TextGrid
AE355LF028.TextGrid AE362CF036.TextGrid AE371LF024.TextGrid
AE380LF027.TextGrid AE389LM025.TextGrid
but;the bVd@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE062LM022.TextGrid AE073CM022.TextGrid AE076CF022.TextGrid
AE151CM038.TextGrid AE220LM021.TextGrid AE271LM026.TextGrid
AE289LF038.TextGrid AE299LF034.TextGrid AE340LF028.TextGrid
but;the bVD@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE029CF035.TextGrid AE074CM034.TextGrid AE182LM034.TextGrid
AE205LF033.TextGrid AE298LM026.TextGrid AE391LF034.TextGrid
but;the b@t@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE156CF022.TextGrid AE273LF029.TextGrid AE287LM040.TextGrid
AE388LM029.TextGrid
but;the bad@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE102LM025.TextGrid AE107LF022.TextGrid AE123LM040.TextGrid
but;the b@d@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE164LM033.TextGrid AE333LF036.TextGrid
but;the bVDI K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE034LM026.TextGrid AE294LF025.TextGrid
but;the bVtI K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE317LF038.TextGrid AE392LF032.TextGrid
but;the bat@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE145CM035.TextGrid
but;the bVt@? K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE354LF040.TextGrid
but;the bVdE: K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE142LF029.TextGrid
but;the bVt3: K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE313LM028.TextGrid
but;the bVt? K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE009LF024.TextGrid
but;the bV4@ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE159CF029.TextGrid
but;the bVdI K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE293LM029.TextGrid
but;the bVt K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE171CM029.TextGrid
but;the;*elen+ bad9lIn K:\Phon_corrected_FLE\AE_phon_corrected_final\merge
% AE165LM033.TextGrid
but;the;elementary bVd@{l@mEntR\I
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE319LF030.TextGrid
but;the;elementary bVD{l@mEntr\i:
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 16 of 78
AE064LM035.TextGrid
but;the;elementary bVD@l@mEntr\I
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE342CM032.TextGrid
but;the;elementary bVDEl@mEntr\i
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE379LF021.TextGrid
but;the;elementary bVDEl@mEnt@r\i:
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE128LF038.TextGrid
but;the;elementary bVd@l@mEntr\I
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE394LM024.TextGrid
but;the;elementary bDE:l@mEntr\y
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE193CM035.TextGrid
but;the;elementary baDEl@mEntri:
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE143LM031.TextGrid
but;the;elementary baDEl@mEntr\i
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE001LM023.TextGrid
but;the;elementary bVdE:l@m@ntr\i:
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE356LM023.TextGrid
butterfly bVtfV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE003LM026.TextGrid
butterfly bVt@flV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE153CF029.TextGrid
butterscotch bat@rskQt-\S
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE028CF021.TextGrid AE028LF021.TextGrid
butterscotch bat@skQt-\S
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE028LF020.TextGrid
buy ba-\i K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE215CF029.TextGrid
buying ba-\ijIN K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE206LF031.TextGrid
by bV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE007LF037.TextGrid AE045LM026.TextGrid AE052CM030.TextGrid
AE142LF037.TextGrid AE158CF032.TextGrid AE166LF040.TextGrid
AE208LF033.TextGrid AE220LM023.TextGrid AE263CF029.TextGrid
AE270CF024.TextGrid AE276CM033.TextGrid AE285LF022.TextGrid
AE312CF037.TextGrid AE343CM030.TextGrid AE350LF040.TextGrid
AE359CF023.TextGrid
by ba-\i K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE001LM039.TextGrid AE116CF035.TextGrid AE150LF039.TextGrid
AE165LM030.TextGrid AE186LF038.TextGrid AE365LF021.TextGrid
by b@-\i K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE198CF014.TextGrid AE198CF040.TextGrid AE200CF015.TextGrid
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 17 of 78
by ba-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE032LM038.TextGrid AE092CM027.TextGrid
by;a bV-\I K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE361LF021.TextGrid
ca k{? K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE136CF032.TextGrid
cab k{:b K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE185LM021.TextGrid
cab k{ K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE153CF037.TextGrid
cab k{b K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE141LM038.TextGrid
cab;broke k{:br\@-\uk K:\Phon_corrected_FLE\AE_phon_corrected_final\merge
% AE168LF026.TextGrid
cafeteria k{f@tI-\@r\Ij@
K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE104LM022.TextGrid
calf kA:f K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE185LM040.TextGrid
call kO:l K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE013LM036.TextGrid AE019CF034.TextGrid AE029CF024.TextGrid
AE029LF024.TextGrid AE031CF029.TextGrid AE036LF022.TextGrid
AE045LM034.TextGrid AE047LF028.TextGrid AE060CF034.TextGrid
AE075CM030.TextGrid AE077CF026.TextGrid AE079CF039.TextGrid
AE091CM038.TextGrid AE103LM026.TextGrid AE106LF032.TextGrid
AE130LF028.TextGrid AE135CM037.TextGrid AE144LM039.TextGrid
AE147LM040.TextGrid AE154CM022.TextGrid AE156CF028.TextGrid
AE168LF030.TextGrid AE169LF022.TextGrid AE171CM035.TextGrid
AE176CF031.TextGrid AE182LM038.TextGrid AE204LM035.TextGrid
AE223LF029.TextGrid AE230CF031.TextGrid AE275CF025.TextGrid
AE276CM025.TextGrid AE281CF036.TextGrid AE283LM026.TextGrid
AE329LF026.TextGrid AE333LF025.TextGrid AE359CF025.TextGrid
AE361LF025.TextGrid AE380LF040.TextGrid
call kO: K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE148LF031.TextGrid AE151CM023.TextGrid AE181LM035.TextGrid
AE287LM028.TextGrid AE371LF030.TextGrid
call kOl K:\Phon_corrected_FLE\AE_phon_corrected_final\merge %
AE024LM027.TextGrid
.
.
.
Each entry/line has the following format:
ort_word phon_sequence directory % TextGrids
From this it should be obvious that given a lexicon entry it is easy (with a script of
course!) to locate its associated directories and TextGrid files. Note that the
percentage sign was used in order to simplify scripting.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 18 of 78
7) Select which words should be added to the lexicon:
1. All the orthographic words.
2. Only those words standing between dollar markers.
3. Only those words that are not standing between dollar markers.
Here the user can specify which words should be included in the lexicon.
2.3.4 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.3.5 Generated Output Files
If one or more errors are found in the TextGrids, the following file will be created in the
original startup directory before program execution is stopped:
• DirectoriesWithErrors.txt
Holds the names of all the directories in which TextGrid files with errors were found.
Note that this file will only be created if working in one or more directories.
In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)
subdirectory will be created in each affected directory that will hold the TextGrid files
containing the errors. Their alaw files (lying in working directory or alaw directory) will be
copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the
FWE folder and will contain an error message for each affected TextGrid file. Note that to
work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).
If VAR lexicon contains entries, the following files (filenames specified on command line by
user) will be created in the original startup directory:
• The file holding the VAR lexicon.
• The file holding the a priori lexicon.
• The file holding the counts lexicon.
• The file holding the location information.
2.4 CheckAndReplaceOrtNames.pl
2.4.1 Overview
Checks the name of the orthographic transcription in the TextGrid files. If it's not the
desired name it gets replaced with the correct one, namely “Orthographic”. TextGrid files
are allowed to contain only orthographic transcriptions, or orthographic and phonetic
transcriptions.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 19 of 78
2.4.2 Command Line Options
1) CheckAndReplaceOrtNames.pl
The orthographic transcription names are checked and replaced (if necessary) in
the TextGrid files that can be found in the directories which are specified in the list
file DirectoryList.lst.
2) CheckAndReplaceOrtNames.pl -f
Will check and replace (if necessary) the specified TextGrid file's orthographic
transcription name.
3) CheckAndReplaceOrtNames.pl -d
Will check and replace (if necessary) the orthographic transcription names of
specified directory's TextGrid files.
4) CheckAndReplaceOrtNames.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.4.3 Generated Output Files
See section 2.2.4.
2.5 CheckForErrors.pl
2.5.1 Overview
TextGrid files are checked for transcription errors. The files are allowed to contain only
orthographic transcriptions, or orthographic and phonetic transcriptions.
2.5.2 Command Line Options
1) CheckForErrors.pl
Will check the TextGrid files of the directories which are specified in the list file
DirectoryList.lst for errors.
2) CheckForErrors.pl -f
Will only check the specified TextGrid file for errors.
3) CheckForErrors.pl -d
Will only check the specified directory's TextGrid files for errors.
4) CheckForErrors.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 20 of 78
2.5.3 Questions
1) Checking:
1. Nguni
2. Sesotho
3. Afrikaans
4. English
Simply select the language that’s going to be worked with.
2) Does TextGrids contain ort and phon transcriptions (y/n)
If the TextGrid files contain orthographic and phonetic transcriptions answer “y”.
Otherwise, if the TextGrid files contain only orthographic transcriptions answer “n”.
3) Check utterance and sentence marker locations (y/n)
Answer “y” to this question in order to make sure that utterance and sentence
markers are used correctly with respect interval boundaries, and that pauses
always exist between utterances and sentences.
4) Check alignment & phonetic symbols (y/n)
Answer “y” if the alignment between the orthographic and phonetic transcriptions
must be checked (every orthographic segment must have a corresponding phonetic
segment) as well as the phonetic transcription’s phonetic symbols.
Note that this question will only be asked if the answer to Question 2 is “y”.
5) Check for identical ort and phon segments (y/n)
Answer “y” if the aligned transcriptions (ort & phon) must be checked for
orthographic words and phonetic sequences that are identical. These instances will
be treated as if they are actual errors, which of course is not necessarily the case.
Note that this question will only be asked if the answer to Question 4 is “y”.
6) Phon tier contains
1. XSAMPA
2. Praat
Select the phonetic transcription format.
Note that this question will only be asked if the answer to Question 4 is “y”.
2.5.4 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly if
the answer to Question 2.5.3.4 is “y”:
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 21 of 78
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.5.5 Generated Output Files
If one or more errors are found in the TextGrids, the following file will be created in the
original startup directory before program execution is stopped:
• DirectoriesWithErrors.txt
Holds the names of all the directories in which TextGrid files with errors were found.
Note that this file will only be created if working in one or more directories.
In addition to this, if directories are found to contain errors, a FilesWithErrors (FWE)
subdirectory will be created in each affected directory that will hold the TextGrid files
containing the errors. Their alaw files (lying in working directory or alaw directory) will be
copied to FWE subdirectory as well. A FilesWithErrors.txt file will also be created in the
FWE folder and will contain an error message for each affected TextGrid file. Note that to
work through these error files use WorkThroughFWEErrorFiles.pl (Section 2.35).
2.6 CheckForInvalidIntervals.pl
2.6.1 Overview
Script checks intervals occurring at the beginning and the end of the transcriptions
(therefore the first interval and last interval) to see if they are invalid. TextGrids may
contain either orthographic transcriptions, or orthographic and phonetic transcriptions.
2.6.2 Command Line Options
1) CheckForInvalidIntervals.pl
Checks intervals occurring at the beginning and the end of the transcriptions in the
TextGrid files that can be found in the directories which are specified in the list file
DirectoryList.lst.
2) CheckForInvalidIntervals.pl -f
Checks intervals occurring at the beginning and the end of the transcriptions in the
specified TextGrid file.
3) CheckForInvalidIntervals.pl -d
Checks interval markers in the specified directory's TextGrid files.
4) CheckForInvalidIntervals.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 22 of 78
2.6.3 Generated Output Files
See Section 2.5.5 for the output files that will be generated if errors are located during
program execution.
If directories are found containing TextGrid files with possible (just because it’s flagged by
the script does not mean that it’s necessarily invalid!) invalid intervals, the following files
will be created:
• In the startup directory: DirectoriesWithPossibleIntervalsToRemove.txt
Will hold the directory names that contain TextGrid files with possible intervals to
remove.
• In each affected TextGrid directory: FilesWithPossibleIntervalsToRemove.txt
Will hold the names of the TextGrid files in this directory that contain possible
intervals to remove.
2.7 CheckLexicon.pl
2.7.1 Overview
This script checks a specified NOVAR5 or VAR6 lexicon for errors. The lexicon's phones
must be in XSAMPA format. No command line arguments are necessary.
2.7.2 Questions
1) Please enter the name of the file containing the lexicon.
Here you must enter the name of the NOVAR or VAR lexicon that must be checked.
2.7.3 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.7.4 Generated Output Files
If errors were found in the lexicon, one or more of the following files will be created in the
startup directory depending on the type of errors that were found:
5
A NOVAR lexicon contains only one phonetic sequence for each orthographic word.
6
A VAR lexicon usually contains various phonetic sequences for each orthographic word. The phonetic sequences of
an orthographic word are separated with commas.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 23 of 78
• GeneralLexiconErrors.txt
Will hold the general types of errors (this does not include phonetic symbol errors)
that were encountered on each line of the lexicon. An example of such a file is
shown belown:
Line 6: White space encountered at the end of the phone sequence.
Line 17: "]" should be the last character on rhs of "=" sign.
Line 22: "[" should be the first character after the "=" sign.
Line 23: "=" should not occur at the beginning of the line.
Line 41: White space encountered at the beginning of the phone sequence.
• PhoneLexiconErrors.txt
Will hold the invalid phonetic symbols that were encountered on each line of the
lexicon. An example of such a file is shown below:
Line 7: Unknown phone(s): basdf (6)
Line 14: Unknown phone(s): t-\S@l (3) , Sl@ (14) , t-\Sl (25)
Line 19: Unknown phone(s): b{ (1)
Line 20: Unknown phone(s): @fO: (10)
Note that the phone’s position will always be given between parentheses.
• UniqueUnknownPhonesInLexicon.txt
Will hold the unique list of invalid phonetic symbols that were encountered in the
lexicon.
• DuplicatedWordsInLexicon_CSS.txt
Will hold those orthographic words that were found to occur more than once during
a case sensitive search (CSS) of the lexicon.
If no errors were found in the lexicon, the following file could possibly be created in the
startup directory:
• DuplicatedWordsInLexicon_CIS.txt
Will hold those orthographic words that were found to occur more than once during
a case insensitive search (CIS) of the lexicon. Possible errors could be found this
way. The user should therefore always inspect this file before starting to use the
lexicon.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 24 of 78
2.8 CheckNumberOfALawsAndTextGrids.pl
2.8.1 Overview
Call-subdirectories are checked to see whether they contain the same number of alaw and
orthographic TextGrid files.
2.8.2 Command Line Options
1) CheckNumberOfALawsAndTextGrids.pl
Checks to see whether the number of alaw files and the number of orthographic
TextGrid files that can be found in the directories (each containing call-
subdirectories with these data files) which are specified in the list file
DirectoryList.lst, are the same.
2) CheckNumberOfALawsAndTextGrids.pl -d
Checks to see whether the number of alaw files and the number of orthographic
TextGrid files that can be found in the specified directory (containing call-
subdirectories with these data files) are the same.
3) CheckNumberOfALawsAndTextGrids.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.8.3 Generated Output Files
If one or more errors (with respect to number of alaw and TextGrid files in call-
subdirectories) are found, the following files will be created:
• In the original startup directory: DirectoriesWithErrors.txt
Holds the names of all the base directories containing one or more call-
subdirectories in which the number of alaw and TextGrid files are not the same.
• In affected base directories (holds call-subdirectories): CallFoldersWithProblems.txt
Holds the list of call folders in which alaw/TextGrid problems have been
encountered.
• In the affected call folders - one or both of the following files:
MissingALawFiles.txt
Holds the names of the missing alaw files.
MissingTextGridFiles.txt
Holds the names of the missing TextGrid files.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 25 of 78
2.9 Cleanup.pl
2.9.1 Overview
The transcriptions are forced to comply with the specifications with regards to space
characters, markers, spelled letters, empty intervals, and the use of */+/~/=. It can work
with TextGrid files containing only orthographic transcriptions, or orthographic and
phonetic transcriptions.
2.9.2 Command Line Options
1) Cleanup.pl
Will force the TextGrid files that can be found in the directories which are specified
in the list file DirectoryList.lst to comply with specifications.
2) Cleanup.pl -f
Will only force the specified TextGrid file to comply with specifications.
3) Cleanup.pl -d
Will only force the specified directory's TextGrid files to comply with specifications.
4) Cleanup.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.9.3 Questions
1) Remove "=" character from orthographic transcription (y/n)
Answer “y” to remove “=” characters from Afrikaans orthographic transcriptions.
Note that this characters are needed during the generation of the deterministic
phonetic transcriptions. Therefore, use this option wisely!
2.9.4 Generated Output Files
See Section 2.2.4.
2.10 CompareAlawAndTextGridFilenames.pl
2.10.1 Overview
Compares alaw and TextGrid filenames in order to make sure that each alaw file has a
TextGrid companion, and that each TextGrid file has an alaw companion. Note that a base
directory must be specified in a list file or on the command line. The base directory must
contain the alaw and TextGrid directories. The list file will be allowed to contain only one
entry.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 26 of 78
2.10.2 Command Line Options
1) CompareAlawAndTextGridFilenames.pl
Base directory will be extracted from DirectoryList.lst.
2) CompareAlawAndTextGridFilenames.pl -d
Base directory will be extracted from command line.
3) CompareAlawAndTextGridFilenames.pl -l
Base directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
2.10.3 Generated Output Files
The following files could be created in the startup directory:
• TextGridsWithoutALawCompanions.txt
Will be created if TextGrids (in TextGrid directory) are found that don’t have alaw (in
alaw directory) companions.
• ALawsWithoutTextGridCompanions.txt
Will be created if alaws (in alaw directory) are found that don’t have TextGrid (in
TextGrid directory) companions.
2.11 Converter.pl
2.11.1 Overview
Converts phonetic transcriptions in TextGrid files from one format to another, e.g.
XSAMPA -> Praat or Praat -> XSAMPA. Note that the TextGrid files must contain
orthographic and phonetic transcriptions.
2.11.2 Command Line Options
1) Converter.pl
Converts the phonetic transcription format of the TextGrid files that can be found in
the directories which are specified in the list file DirectoryList.lst.
2) Converter.pl -f
Will only convert the phonetic transcription format of the specified TextGrid file.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 27 of 78
3) Converter.pl -d
Will only convert the phonetic transcription format of the specified directory's
TextGrid files.
4) Converter.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.11.3 Questions
1) For this script to work properly, the data must be free of all CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of ALL CFE (CheckForErrors.pl)
errors7. Otherwise, answer “n” in order to abort the program.
2) Convert
1. XSAMPA to Praat
2. Praat to XSAMPA
Simply choose the conversion process that must be performed.
2.11.4 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.11.5 Generated Output Files
See Section 2.2.4.
2.12 CopyAssimErrorFilesToFWE.pl
2.12.1 Overview
TextGrids with possible assimilation errors and their alaws are copied to FilesWithErrors
(FWE) subdirectory under the TextGrid directory. Note that a base directory must be
specified in a list file or on the command line. The base directory must contain the alaw
and TextGrid directories. The list file will be allowed to contain only one entry.
7
All CFE errors include a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 28 of 78
2.12.2 Command Line Options
1) CopyAssimErrorFilesToFWE.pl
Base directory will be extracted from DirectoryList.lst.
2) CopyAssimErrorFilesToFWE.pl -d
Base directory will be extracted from command line.
3) CopyAssimErrorFilesToFWE.pl -l
Base directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
2.12.3 Questions
1) Please enter name of file holding assimilation information.
The name of the file holding the possible assimilation errors must be entered on the
command line in order to allow the script to determine which TextGrids – and
corresponding alaws – to copy to FilesWithErrors subdirectory under the TextGrid
folder. Note that to work through these error files use
WorkThroughFWEErrorFiles.pl (Section 2.35). Also, remember to copy the
assimilation error file to FWE subdirectory and rename it to FilesWithErrors.txt in
order for WorkThroughFWEErrorFiles.pl to work properly. An example of a file
holding the assimilation information follows below:
AE001LM002.TextGrid:
0 Van;_Der_Linde [/sta]
0 fan@rl@nd@ [/sta]
AE001LM023.TextGrid:
(s) Helen's;stint as
(s) h{l@nst@nt {z
AE001LM023.TextGrid:
(s) but;the;elementary practical
(s) baDEl@mEntr\i pr\{ktik@l
AE001LM033.TextGrid:
one eight;two three
wan @-\itu Tri
AE001LM039.TextGrid:
prison was;surrounded by
pr\@z@n wQs@r\a-\und@d ba-\i
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 29 of 78
2.13 CopyPhoneCorrectedFiles.pl
2.13.1 Overview
Copies phonetically corrected data that's scattered over several base directories to a
single target base directory. Note that each base directory must contain alaw, attrib, lab,
and TextGrid (merge) directories.
2.13.2 Command Line Options
1) CopyPhoneCorrectedFiles.pl
Copies phonetically corrected data that can be found in the base directories which
are specified in the list file DirectoryList.lst to a single target base directory.
2) CopyPhoneCorrectedFiles.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.13.3 Questions
1) Work in merge_XSAMPA (y/n)
Answer “y” to this question if the TextGrid files are lying in merge_XSAMPA
directories. Ontherwise, answer “n” if they are lying in merge directories.
2) Please enter name of directory that will hold the data.
Enter name of target base directory – it should exist. Note that the following
directories will be created in it in order to hold the data, namely:
alaw/attrib/lab/merge.
2.14 CopySNBDErrorFilesToFWE.pl
2.14.1 Overview
TextGrids with SNBD (sil/nonsil/boundary/duration) errors and their alaws are copied to
FilesWithErrors (FWE) subdirectory under the TextGrid directory. Note that a base
directory must be specified in a list file or on the command line. The base directory must
contain the alaw and TextGrid (merge) directories. The list file will be allowed to contain
only one entry. After copying the TextGrid and alaw files to FWE subdirectory a
FilesWithErrors.txt file will be created containing information about the SNBD errors. You
can work through the TextGrid files containing the errors using
WorkThroughFWEErrorFiles.pl (Section 2.35).
2.14.2 Command Line Options
1) CopySNBDErrorFilesToFWE.pl
Base directory will be extracted from DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 30 of 78
2) CopySNBDErrorFilesToFWE.pl -d
Base directory will be extracted from command line.
3) CopySNBDErrorFilesToFWE.pl -l
Base directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
2.14.3 Questions
1) Are there any sil errors (y/n)
Answer “y” if TextGrid files with sil errors8 exist. To move on to Question 3 answer
“n”.
2) Please enter the name of the sil file.
The name of the file holding the sil errors must be entered on the command line in
order to allow the script to determine which TextGrids – and corresponding alaws –
containing this specific type of error to copy to FilesWithErrors subdirectory under
the TextGrid folder. An example of this type of error file is shown below:
AE016CF021_006
AE018CF031_003
AE058LF008_001
AE062LM002_004
AE067LF009_010
AE088LF038_005
AE106LF017_008
AE112CM010_008
AE115CM022_002
AE116CF022_001
AE116CF040_007
AE130LF004_003
AE130LF024_001
AE130LF040_009
AE144LM003_001
AE155LM038_022
AE162LM003_004
The TextGrid base-names and interval numbers (containing the sil errors) are
separated with underscores. Edward wrote a program producing this type of error
file. Further enquiries regarding the generation of these files should therefore be
directed towards him.
Note that this question will only be asked if the answer to Question 1 is “y”.
8
A sil error is when an interval is marked as [sil], but should rather have been for example [int] or [sta].
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 31 of 78
3) Are there any nonsil errors (y/n)
Answer “y” if TextGrid files with nonsil errors9 exist. To move on to Question 5
answer “n”.
4) Please enter the name of the nonsil file.
The name of the file holding the nonsil errors must be entered on the command line
in order to allow the script to determine which TextGrids – and corresponding alaws
– containing this specific type of error to copy to FilesWithErrors subdirectory under
the TextGrid folder. The format of this file is the same as in the case of the sil errors
– see Question 2.
Note that this question will only be asked if the answer to Question 3 is “y”.
5) Are there any boundary errors (y/n)
Answer “y” if TextGrid files with boundary errors10 exist. To move on to Question 7
answer “n”.
6) Please enter the name of the boundary file.
The name of the file holding the boundary errors must be entered on the command
line in order to allow the script to determine which TextGrids – and corresponding
alaws – containing this specific type of error to copy to FilesWithErrors subdirectory
under the TextGrid folder.
AE006LF019_002 right
AE006LF039_003 right
AE017LF002_003 left
AE017LF003_002 right
AE017LF006_002 left
AE017LF010_002 left
EE004LF029_003 right
AE020LM001_003 right
AE020LM013_002 right
AE023LM009_002 right
AE023LM024_003 right
AE024LF007_003 right
AE024LF030_003 right
AE024LF033_002 right
AE024LF037_002 right
AE024LF038_002 right
AE024LF040_003 right
The TextGrid base-names and interval numbers (containing the boundary errors)
are separated with underscores. In addition to this, an indication will be given after
an interval number as to whether the error occurred on the left of right boundary.
Edward wrote a program producing this type of error file. Further enquiries
regarding the generation of these files should therefore be directed towards him.
9
A nonsil error is when an interval is marked as e.g. [int] or [sta], but should rather be marked as [sil].
10
A boundary error is when a boundary is not placed at the correct location within the transcription.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 32 of 78
Note that this question will only be asked if the answer to Question 5 is “y”.
7) Are there any duration errors (y/n)
Answer “y” if TextGrid files with duration errors11 exist. To move on answer “n”.
8) Please enter the name of the duration file.
The name of the file holding the duration errors must be entered on the command
line in order to allow the script to determine which TextGrids – and corresponding
alaws – containing this specific type of error to copy to FilesWithErrors subdirectory
under the TextGrid folder. The format of this file is the same as in the case of the sil
errors – see Question 2.
Note that this question will only be asked if the answer to Question 7 is “y”.
2.15 DeleteRedundantALTFilesLookingAtAttribs.pl
2.15.1 Overview
ALTs – Alaws (alaw folder), Lab files (lab folder – optional) and TextGrids (merge folder) –
that do not have corresponding attribute files (attrib folder) will be removed. Note that a
base directory must be specified in a list file or on the command line. The base directory
must contain the alaw, attrib, lab (optional) and merge directories. The list file will be
allowed to contain only one entry.
2.15.2 Command Line Options
1) DeleteRedundantALTFilesLookingAtAttribs.pl
Base directory will be extracted from DirectoryList.lst.
2) DeleteRedundantALTFilesLookingAtAttribs.pl -d
Base directory will be extracted from command line.
3) DeleteRedundantALTFilesLookingAtAttribs.pl -l
Base directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
11
A duration error is when an interval is either to long or to short.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 33 of 78
2.16 ExtractTranscriptions.pl
2.16.1 Overview
Orthographic and/or phonetic transcriptions are extracted from TextGrid files. TextGrids
are allowed to contain only orthographic transcriptions, or orthographic and phonetic
transcriptions.
2.16.2 Command Line Options
1) ExtractTranscriptions.pl
Extracts orthographic and/or phonetic transcriptions from TextGrid files that can be
found in the directories which are specified in the list file DirectoryList.lst.
2) ExtractTranscriptions.pl -f
Will only extract orthographic and/or phonetic transcriptions from the specified
TextGrid file.
3) ExtractTranscriptions.pl -d
Will only extract orthographic and/or phonetic transcriptions from the specified
directory's TextGrid files.
4) ExtractTranscriptions.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.16.3 Questions
1) Does TextGrids contain ort and phon transcriptions (y/n)
Answer “y” if the TextGrids contain both orthographic and phonetic transcriptions.
Otherwise, if the TextGrids contain only orthographic transcriptions answer “n”.
2) Extract
1. Ort
2. Phon
3. Ort & Phon
Select which transcriptions to extract.
Note that this question will only be asked if the answer to Question 1 is “y”.
2.16.4 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files.
If working with one or more directories, the extracted transcriptions will be stored in the
following file in each TextGrid directory:
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 34 of 78
• Transcriptions.txt
This file will hold a summary of each TextGrid’s transcriptions. Each TextGrid
directory will have such a file. As an example suppose we have a number of
TextGrids in a directory containing orthographic and phonetic transcriptions. If the
script extracts only the orthographic transcriptions, the text file will contain the
following (intervals will always be indicated with the “ character):
AE001LM001.TextGrid:
"0" "[spk]" "(u) A. 0 E. zero zero one (/u)" "[spk]" "[sta]" "[ext]"
AE001LM002.TextGrid:
"(u) [sta] Martin 0 Isak 0 Van;_Der_Linde [/sta] (/u)" "[spk]" "[sta]"
AE001LM003.TextGrid:
"(u) [sta] V. 0 A. E. N. [/sta] (/u)" "[spk]" "(u) [sta] D. 0 E. R. 0 L. 0 I. [/sta] (/u)"
AE001LM004.TextGrid:
"[int]" "[sta]"
AE001LM005.TextGrid:
"[spk]" "(u) nineteen (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"
However, if the script extracts only the phonetic transcriptions the following will be
written to the file:
AE001LM001.TextGrid:
"0" "[spk]" "(u) ?@-\i 0 i zI-\@r@-\u zI-\@r@-\u wan (/u)" "[spk]" "[sta]" "[ext]"
AE001LM002.TextGrid:
"(u) [sta] ma:rt@n 0 isak 0 fan@rl@nd@ [/sta] (/u)" "[spk]" "[sta]"
AE001LM003.TextGrid:
"(u) [sta] vi 0 @-\i i En [/sta] (/u)" "[spk]" "(u) [sta] di 0 i a:r\ 0 {l 0 a-\i [/sta] (/u)"
AE001LM004.TextGrid:
"[int]" "[sta]"
AE001LM005.TextGrid:
"[spk]" "(u) na-\intin (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"
Finally, if both orthographic and phonetic transcriptions are extracted, the
Transcriptions.txt will look as follows:
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 35 of 78
AE001LM001.TextGrid:
"0" "[spk]" "(u) A. 0 E. zero zero one (/u)" "[spk]" "[sta]" "[ext]"
"0" "[spk]" "(u) ?@-\i 0 i zI-\@r@-\u zI-\@r@-\u wan (/u)" "[spk]" "[sta]" "[ext]"
AE001LM002.TextGrid:
"(u) [sta] Martin 0 Isak 0 Van;_Der_Linde [/sta] (/u)" "[spk]" "[sta]"
"(u) [sta] ma:rt@n 0 isak 0 fan@rl@nd@ [/sta] (/u)" "[spk]" "[sta]"
AE001LM003.TextGrid:
"(u) [sta] V. 0 A. E. N. [/sta] (/u)" "[spk]" "(u) [sta] D. 0 E. R. 0 L. 0 I. [/sta] (/u)"
"(u) [sta] vi 0 @-\i i En [/sta] (/u)" "[spk]" "(u) [sta] di 0 i a:r\ 0 {l 0 a-\i [/sta] (/u)"
AE001LM004.TextGrid:
"[int]" "[sta]"
"[int]" "[sta]"
AE001LM005.TextGrid:
"[spk]" "(u) nineteen (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"
"[spk]" "(u) na-\intin (/u)" "[ext]" "[sta]" "[ext]" "[sta]" "[ext]" "[sta]"
2.17 GenerateLabFiles.pl
2.17.1 Overview
Generates lab files for AST engineers using TextGrid files containing orthographic and
phonetic transcriptions. Only the phonetic transcriptions will be listed in the lab files.
2.17.2 Command Line Options
1) GenerateLabFiles.pl
Generates lab files for the TextGrid files that can be found in the directories which
are specified in the list file DirectoryList.lst.
2) GenerateLabFiles.pl -f
Generates a lab file for the specified TextGrid file.
3) GenerateLabFiles.pl -d
Generates lab files for the TextGrid files that can be found in the specified directory.
4) GenerateLabFiles.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 36 of 78
2.17.3 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.17.4 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files during process of running this script.
For each TextGrid file a lab file will be generated. The lab files will be created in either the
TextGrid directory (if lab directory does not exist), or a lab directory (if it exists). As an
example of how a lab file looks, first consider the following TextGrid file:
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 5.6879999999999997
tiers?
size = 2
item []:
item [1]:
class = "IntervalTier"
name = "Orthographic"
xmin = 0
xmax = 5.6879999999999997
intervals: size = 5
intervals [1]:
xmin = 0
xmax = 0.43666928678720285
text = "0"
intervals [2]:
xmin = 0.43666928678720285
xmax = 0.7731976451954351
text = "[spk]"
intervals [3]:
xmin = 0.7731976451954351
xmax = 2.4472260395268113
text = "(s) i am going to speak english (/s)"
intervals [4]:
xmin = 2.4472260395268113
xmax = 2.944283528471245
text = "[spk]"
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 37 of 78
intervals [5]:
xmin = 2.944283528471245
xmax = 5.6879999999999997
text = "[sta]"
item [2]:
class = "IntervalTier"
name = "phon1"
xmin = 0
xmax = 5.6879999999999997
intervals: size = 5
intervals [1]:
xmin = 0
xmax = 0.43666928678720285
text = "0"
intervals [2]:
xmin = 0.43666928678720285
xmax = 0.7731976451954351
text = "[spk]"
intervals [3]:
xmin = 0.7731976451954351
xmax = 2.4472260395268113
text = "(s) V-\I {m g@-\UwIN tu: spi:k INglIS (/s)"
intervals [4]:
xmin = 2.4472260395268113
xmax = 2.944283528471245
text = "[spk]"
intervals [5]:
xmin = 2.944283528471245
xmax = 5.6879999999999997
text = "[sta]"
It’s generated lab will look as follows:
BHEAD
EHEAD
End_Time Phon1 Boundary_Type Category
0.43667 [sil] Manual sil
0.77320 [spk] Manual spk
0.85690 V-\I EquiDiv phon
0.94060 { EquiDiv phon
1.02430 m EquiDiv phon
1.10800 g EquiDiv phon
1.19170 @-\U EquiDiv phon
1.27541 w EquiDiv phon
1.35911 I EquiDiv phon
1.44281 N EquiDiv phon
1.52651 t EquiDiv phon
1.61021 u: EquiDiv phon
1.69391 s EquiDiv phon
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 38 of 78
1.77761 p EquiDiv phon
1.86132 i: EquiDiv phon
1.94502 k EquiDiv phon
2.02872 I EquiDiv phon
2.11242 N EquiDiv phon
2.19612 g EquiDiv phon
2.27982 l EquiDiv phon
2.36352 I EquiDiv phon
2.44723 S Manual phon
2.94428 [spk] Manual spk
5.68800 [sta] Manual sta
Note that the “Boundary_Type” column shows where interval boundaries occur that was
put in by hand by using the word “Manual”. The “EquiDiv” boundaries are calculated with
the script and is NOT accurate. These values are obtained by simply dividing an interval’s
length by the number of items in it. The “Category” column simply states the type of event
that is taking place during that time slice.
Note that this script should be updated in order for the lab files to show more information,
e.g. orthographic information.
2.18 GenOrt.pl
2.18.1 Overview
Generates orthographic TextGrids from TextGrid files containing orthographic and
phonetic transcriptions. TextGrid files will only be allowed to contain orthographic and
phonetic transcriptions. WARNING: The original TextGrid files will be replaced with the
newly generated orthographic TextGrids.
2.18.2 Command Line Options
1) GenOrt.pl
Will generate orthographic TextGrids from the merged files that can be found in the
directories which are specified in the list file DirectoryList.lst.
2) GenOrt.pl -f
Will only generate orthographic TextGrid from the specified merged file.
3) GenOrt.pl -d
Will only generate orthographic TextGrids from the specified directory's merged
files.
4) GenOrt.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 39 of 78
2.18.3 Questions
1) WARNING: This script will replace merged files with orthographic TextGrids.
Do you want to continue? (y/n)
Answer “y” if you want to go ahead with generating the orthographic TextGrids from
the merged TextGrids. Otherwise, answer “n” to abort the program.
2.18.4 Generated Output Files
See Section 2.2.4.
2.19 GetPronouncedAcronyms.pl
2.19.1 Overview
Extracts pronounced acronyms from TextGrid files. The files are allowed to contain only
orthographic transcriptions, or orthographic and phonetic transcriptions.
2.19.2 Command Line Options
1) GetPronouncedAcronyms.pl
Extracts pronounced acronyms from the TextGrid files that can be found in the
directories which are specified in the list file DirectoryList.lst.
2) GetPronouncedAcronyms.pl -f
Extracts pronounced acronyms from the specified TextGrid file.
3) GetPronouncedAcronyms.pl -d
Extracts pronounced acronyms from the specified directory's TextGrid files.
4) GetPronouncedAcronyms.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.19.3 Questions
1) For this script to work properly, the data must be free of normal CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)
errors12. Otherwise, answer “n” in order to abort the program.
12
Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 40 of 78
2.19.4 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files during process of running this script.
If pronounced acronyms are found, the following file will be created in the startup directory:
• PronouncedAcronyms.txt
The pronounced acronyms will simply be listed in this file. It will be created whether
working with one or more directories or a single TextGrid (-f switch).
2.20 GetTextGridDirectoriesRecursively.pl
2.20.1 Overview
Recursively looks for directories containing TextGrid files.
2.20.2 Command Line Options
1) GetTextGridDirectoriesRecursively.pl
Will recursively look for folders containing TextGrid files in the directories listed in
DirectoryList.lst.
2) GetTextGridDirectoriesRecursively.pl -d
Will recursively look for folders containing TextGrid files in the specified directory.
3) GetTextGridDirectoriesRecursively.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.20.3 Generated Output Files
If directories were found containing TextGrid files the following file will be created in the
startup directory:
• Directories.txt
The directories that contain TextGrid files will simply be listed in this file.
2.21 Lin2Dos.pl
2.21.1 Overview
Converts a text file from Linux to DOS/Win format. Note that the original text file will be
overwritten.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 41 of 78
2.21.2 Command Line Options
• Lin2Dos.pl
2.22 MoveDuplicateAttribs.pl
2.22.1 Overview
Move all duplicate attribute files from attribute directory to a specified directory. Note that
the attribute directory must be specified in a list file or on the command line. The list file will
be allowed to contain only one entry. Also, duplicate attribute files will be removed from the
attribute directory during the process of moving them.
2.22.2 Command Line Options
1) MoveDuplicateAttribs.pl
Attribute directory will be extracted from DirectoryList.lst.
2) MoveDuplicateAttribs.pl -d
Attribute directory will be extracted from command line.
3) MoveDuplicateAttribs.pl -l
Attribute directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
2.22.3 Questions
1) Please enter name of the destination folder.
Specify the destination folder’s name to which the identified duplicate attribute files
must be moved.
2) Remove files that exist in this directory (y/n)
If this destination directory contains files that should be removed answer “y”.
Otherwise, to specify a different directory answer “n”.
2.23 MovePhoneticTextGrids.pl
2.23.1 Overview
Generated (using Patana) phonetic transcription files are moved from orthographic
subdirectories (ort) to phonetic subdirectories (phon).
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 42 of 78
2.23.2 Command Line Options
1) MovePhoneticTextGrids.pl
Generated phonetic transcriptions that can be found in the directories (each
containing subdirectories ort and phon) which are specified in the list file
DirectoryList.lst are moved from orthographic subdirectories (ort) to phonetic
subdirectories (phon).
2) MovePhoneticTextGrids.pl -d
Generated phonetic transcriptions that can be found in the specified directory
(containing subdirectories ort and phon) are moved from orthographic subdirectory
(ort) to phonetic subdirectory (phon).
3) MovePhoneticTextGrids.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.24 NumberOfMalesAndFemales.pl
2.24.1 Overview
Determines how many male and female callers there are for a batch of data. This
information will be extracted from the attribute filenames lying under the attrib directories.
2.24.2 Command Line Options
1) NumberOfMalesAndFemales.pl
Attribute directory names will be extracted from DirectoryList.lst.
2) NumberOfMalesAndFemales.pl -d
Attribute directory names will be extracted from command line.
3) NumberOfMalesAndFemales.pl -l
Attribute directory names will be extracted from the newly specified list file instead
of DirectoryList.lst.
2.25 ProcessRawTranscriptionData.pl
2.25.1 Overview
Processes raw transcription data lying in call folders - renames alaw, attribute and
TextGrid files and then moves them to the appropriate directories. The call folders will be
replaced by the following directories: alaw/attrib/lab/merge/ort/phon. The only directories of
these that will contain any files after the process has been completed will be alaw, attrib
and ort.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 43 of 78
2.25.2 Command Line Options
1) ProcessRawTranscriptionData.pl
Processes the raw data that can be found in the directories (the base directories to
call folders) which are specified in the list file DirectoryList.lst.
2) ProcessRawTranscriptionData.pl -d
Will only process the specified directory's (the base directory to call folders) raw
data.
3) ProcessRawTranscriptionData.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.25.3 Questions
1) What value must M have (A/E/I/B/C/X/S/Z allowed).
This value of M will be used for error checking purposes. As can be seen, only A, E,
I, B, C, X, S, and Z will be allowed.
2) What value must L have (A/E/X/S/Z allowed).
This value of L will be used for error checking purposes. As can be seen, only A, E,
X, S, and Z will be allowed.
2.25.4 Generated Output Files
If any of the attribute files were found to contain errors, the following file will be created in
the startup directory before aborting the program:
• AttribErrors.txt
Will list all the errors that occurred in the attribute files as seen over all the attribute
directories it worked in.
2.26 RemoveDuplicateAttrib.pl
2.26.1 Overview
Removes duplicate attribute files that have been identified in earlier sessions from attribute
directory by looking at the call numbers contained within each attribute file. Note that the
attribute directory must be specified in a list file or on the command line. The list file will be
allowed to contain only one entry. The script will ask for the name of the directory holding
the duplicate attribute files.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 44 of 78
2.26.2 Command Line Options
1) RemoveDuplicateAttribs.pl
Attribute directory will be extracted from DirectoryList.lst.
2) RemoveDuplicateAttribs.pl -d
Attribute directory will be extracted from command line.
3) RemoveDuplicateAttribs.pl -l
Attribute directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
2.26.3 Questions
1) Please enter name of folder holding duplicate attributes.
Specify where duplicate attribute files can be found that must be removed from
attribute directory. Note that the deletion process looks at the call numbers within
the attribute files in order to figure out which files to remove. The attribute filenames
therefore does not play a part in the deletion process.
2.27 RemoveInvalidIntervals.pl
2.27.1 Overview
Removes those interval markers that were wrongly put into the transcriptions at the
beginning and the end of the speech (therefore the first interval and last interval) in Praat.
TextGrids may contain either orthographic transcriptions, or orthographic and phonetic
transcriptions. Warning: Avoid running this script more than once on the data.
2.27.2 Command Line Options
1) RemoveInvalidIntervals.pl
Removes invalid intervals from the TextGrid files that can be found in the directories
which are specified in the list file DirectoryList.lst.
2) RemoveInvalidIntervals.pl -f
Removes invalid intervals from the specified TextGrid file.
3) RemoveInvalidIntervals.pl -d
Removes invalid intervals from the specified directory's TextGrid files.
4) RemoveInvalidIntervals.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 45 of 78
2.27.3 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files during the process of running this script.
If working in one or more directories the following files could be created in the startup
directory:
• DirectoriesWithUpdates.txt
Those directories in which invalid intervals were remove from TextGrid files will be
listed in this file.
• DirectoriesWithPossibleIntervalsToRemove.txt
Will list those directories that were found to contain TextGrid files will intervals that
should possibly be removed – these are intervals that the script was unsure about
and as a result did not remove.
Those directories containing TextGrid files with possible intervals to remove will each
contain the following file:
• FilesWithPossibleIntervalsToRemove.txt
Will hold the names of the TextGrid files in this directory that contain possible
intervals to remove.
2.28 RemoveTextGridFilesWithErrors.pl
2.28.1 Overview
Removes TextGrid files with errors from TextGrid directory. Note that the names of the
TextGrid files containing the errors will be obtained from the FilesWithErrors subdirectory
lying under the TextGrid folder. Use this script wisely!
2.28.2 Command Line Options
1) RemoveTextGridFilesWithErrors.pl
Will remove TextGrid files containing errors from the specified TextGrid directory.
2.28.3 Questions
1) WARNING: This script will removed TextGrid files containing errors.
Do you want to continue? (y/n)
To remove the TextGrids containing errors answer “y”. Otherwise, to abort the
program answer “n”.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 46 of 78
2.29 Renamer.pl
2.29.1 Overview
This script fixes those filenames that are not according to spec with respect to case and
counter. Note that each base folder should contain alaw, attrib, lab, and either merge,
merge_Praat or merge_XSAMPA directories.
2.29.2 Command Line Options
1) Renamer.pl
Will fix those filenames which are located in the folders that can be found in the
directories which are specified in the list file DirectoryList.lst.
2) Renamer.pl -d
Will fix those filenames which are located in the folders that can be found in the
specified directory.
3) Renamer.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.29.3 Questions
1) In which directory does the TextGrid files occur:
1. merge
2. merge_Praat
3. merge_XSAMPA
Specify in which directories the TextGrid files occur. Note that the directory names
MUST correspond to the option that you choose.
2.30 ReplaceAssimilations.pl
2.30.1 Overview
Replaces assimilations (#1..#2..#3). It has the ability to replace assimilations with the text
standing between #1..#2 or #2..#3, or it can replace it which something that is specified by
the user. TextGrid files are allowed to contain only orthographic transcriptions, or
orthographic and phonetic transcriptions.
2.30.2 Command Line Options
1) ReplaceAssimilations.pl
Will replace assimilations in the TextGrid files that can be found in the directories
which are specified in the list file DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 47 of 78
2) ReplaceAssimilations.pl -f
Will replace assimilations in the specified TextGrid file.
3) ReplaceAssimilations.pl -d
Will replace assimilations in the specified directory's TextGrid files.
4) ReplaceAssimilations.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.30.3 Questions
1) Replace assimilations with contents between #1 and #2 (y/n)
Answer “y” if assimilations exist that must be replaced with contents between
#1..#2. To move on to Question 3 answer “n”.
2) Please enter name of the file holding these assimilations.
Enter name of file that will hold the list of assimilations which must be replaced with
the text between #1..#2.
Note that this question will only be asked if the answer to Question 1 is “y”.
3) Replace assimilations with contents between #2 and #3 (y/n)
Answer “y” if assimilations exist that must be replaced with contents between
#2..#3. To move on to Question 5 answer “n”.
4) Please enter name of the file holding these assimilations.
Enter name of file that will hold the list of assimilations which must be replaced with
the text between #2..#3.
Note that this question will only be asked if the answer to Question 3 is “y”.
5) Replace assimilations with own suggestions (y/n)
Answer “y” if assimilations exist that must be replaced with your own suggestions.
To move on answer “n”.
6) Please enter name of the file holding these assimilations.
Enter name of file that will hold the list of assimilations and their replacements. The
file should have the following header: “OLD NEW”. This header therefore
defines two columns. Under the “OLD” column will be the list of assimilations
(#1..#2..#3) that must be replaced, while under the “NEW” column will be written
their replacements. Note that the replacement strings should not contain any
spaces.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 48 of 78
Note that this question will only be asked if the answer to Question 5 is “y”.
2.30.4 Generated Output Files
See Section 2.2.4.
2.31 Rip.pl
2.31.1 Overview
This script determines which orthographic words in a given batch of TextGrid files do not
occur in the lexicon. It can work with TextGrid files containing only orthographic
transcriptions, or orthographic and phonetic transcriptions.
2.31.2 Command Line Options
1) Rip.pl
Will check the TextGrid files that can be found in the directories which are specified
in the list file DirectoryList.lst for words that are not in the lexicon.
2) Rip.pl -f
Will check the specified TextGrid file for words that are not in the lexicon.
3) Rip.pl -d
Will check the specified directory's TextGrid files for words that are not in the
lexicon.
4) Rip.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.31.3 Questions
1) For this script to work properly, the data must be free of normal CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)
errors13. Otherwise, answer “n” in order to abort the program.
2) Enter the name of the file containing the NOVAR lexicon.
Specify the name of the NOVAR lexicon14 against which the script compares the
orthographic words it encounters in order to see which occur in the lexicon and
which do not.
13
Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
14
A NOVAR lexicon contains only one phonetic sequence for each orthographic word.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 49 of 78
3) Select which words should be compared to the lexicon
1. All the orthographic words.
2. Only those words standing between dollar markers.
3. Only those words that are not standing between dollar markers.
Here the user must specify which words should be compared to the lexicon entries.
4) Only check words which contain internal zeros (y/n)
Answer “y” if only those words with internal zeros should be compared to the
lexicon. Otherwise, answer “n” to compare ALL the words (including the words with
internal zeros) to the lexicon.
2.31.4 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.31.5 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files during the process of running this script.
If working with one or more directories, the following files could possibly be created in the
startup directory:
• TextGridFilesContainingWordsNotInLexicon.txt
This file will hold for each affected TextGrid, the words which do not occur in the
lexicon. The words are identified during a CIS (case insensitive search) and should
thus be included in the lexicon. An example of such a file is shown below (note: the
lexicon contained few entries):
J:\PhonCorrect\EE\merge:
EE002LM002.TextGrid - Words not in lexicon:
Michael
Titlestad
EE002LM003.TextGrid - Words not in lexicon:
A.
D.
E.
I.
L.
S.
T.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 50 of 78
EE002LM004.TextGrid - Words not in lexicon:
male
EE002LM005.TextGrid - Words not in lexicon:
six
thirty
EE002LM006.TextGrid - Words not in lexicon:
September
four
nineteen
of
sixth
sixty
twenty
.
.
.
• TextGridFilesContainingWordsWithCaseProblems.txt
This file will hold for each affected TextGrid, the words which do not occur in the
lexicon during a CSS (case sensitive search), but do occur during a CIS. These
words could possibly contain case problems and as such should be handled
separately from those words which are identified during a CIS. An example of such
a file is shown below (the same lexicon and data was used as in the above
example):
J:\PhonCorrect\EE\merge:
EE002LM014.TextGrid - Words with case problems:
May
EE002LM017.TextGrid - Words with case problems:
No
EE004LF014.TextGrid - Words with case problems:
May
EE006LF006.TextGrid - Words with case problems:
May
.
.
.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 51 of 78
This example shows that two words cause problems, namely “May” and “No”. Both
occurred in the lexicon as “may” and “no”. Now, since the word “may” occurred in
the lexicon, but the word “May” did not, the word “May” was flagged as possibly
being a word with a case problem. However, in this instance it was not, since “May”
was used as the name of a month and as such should be included in the lexicon
(remember that the lexicon is case sensitive!). On the other hand, the word “No”
was incorrectly used in the data, since it should have been written as “no”. As a
result, it must be changed in the TextGrid file to “no”.
• WordsNotInLexicon.txt
Will hold a unique list of all the words that were identified during a CIS to not be
included in the lexicon.
• WordsWithCaseProblems.txt
Will hold a unique list of all the words which do not occur in the lexicon during a
CSS, but do occur during a CIS.
If working with a single TextGrid file, the following files could possibly be created in the
startup directory:
• WordsNotInLexicon.txt
• WordsWithCaseProblems.txt
2.32 SubstitutePhonCharacters.pl
2.32.1 Overview
Substitutes certain characters (or strings) in each TextGrid file's phonetic transcription with
new ones. The characters (or strings) and their replacements must be specified in
SubstitutePhonCharacters.lst. Also note that the TextGrid files must contain both
orthographic and phonetic transcriptions.
2.32.2 Command Line Options
1) SubstitutePhonCharacters.pl
Substitutes certain phon characters in the TextGrid files that can be found in the
directories which are specified in the list file DirectoryList.lst.
2) SubstitutePhonCharacters.pl -f
Substitutes certain phon characters in the specified TextGrid file.
3) SubstitutePhonCharacters.pl -d
Substitutes certain phon characters in the specified directory's TextGrid files.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 52 of 78
4) SubstitutePhonCharacters.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.32.3 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• SubstitutePhonCharacters.lst
Will hold the list of characters or text strings to be replaced and their replacements.
The file should have the following header: “OLD NEW”. This header therefore
defines two columns. Under the “OLD” column will be the list of text strings that
must be replaced, while under the “NEW” column will be written their replacement
strings. Note that the replacement strings should not contain any spaces.
2.32.4 Generated Output Files
See Section 2.2.4.
2.33 Transcribe.pl
2.33.1 Overview
This script can generates deterministic phonetic transcriptions for EE, IE, AE, BE, CE, SS,
XX and ZZ. Only orthographic transcriptions are allowed in the TextGrid files. The TextGrid
files will afterwards contain both orthographic and deterministic phonetic transcriptions.
2.33.2 Command Line Options
1) Transcribe.pl
Will transcribe the TextGrid files that can be found in the directories which are
specified in the list file DirectoryList.lst.
2) Transcribe.pl -f
Will only transcribe the specified TextGrid file.
3) Transcribe.pl -d
Will only transcribe the specified directory's TextGrid files.
4) Transcribe.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 53 of 78
2.33.3 Questions
1) For this script to work properly, the data must be free of normal CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)
errors15. Otherwise, answer “n” in order to abort the program.
2) WARNING: This script will convert the orthographic TextGrids to merged files.
Do you want to continue? (y/n)
Answer “y” to allow the script to convert the orthographic TextGrids to merged
TextGrid files containing both orthographic and phonetic transcriptions. Otherwise,
answer “n” in order to abort the program.
3) Please specify the language
1. EE/IE/AE/BE/CE
2. SS
3. XX
4. ZZ
Simply choose the language to be transcribed.
4) Enter name of file containing NOVAR lexicon.
Specify the name of the NOVAR lexicon16 that will be used during the transcription
process.
Note that this question will only be asked if the answer to Question 3 is “1” =>
EE/IE/AE/BE/CE.
5) Enter name of file containing grapheme-to-phon conversion rules.
Specify the name of the file holding the grapheme-to-phoneme conversion rules.
This script will be able to work with the rules of SS (Appendix B), XX (Appendix C),
and ZZ (Appendix D). Note that CheckLexicon.pl (Section 2.7) can be used to
check a grapheme-to-phoneme conversion rule lexicon for errors.
Note that this question will only be asked if the answer to Question 3 is “2”, “3” or
“4” => SS/XX/ZZ.
6) Enter name of file containing NOVAR lexicon.
Specify the name of the NOVAR lexicon that will be used during the transcription
process. Note that this lexicon must only contain the transcriptions of spelled letters
and words with internal zeros.
Note that this question will only be asked if the answer to Question 3 is “2”, “3” or
“4” => SS/XX/ZZ.
15
Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
16
A NOVAR lexicon contains only one phonetic sequence for each orthographic word.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 54 of 78
7) Enter name of file containing BE NOVAR lexicon
Specify the name of the BE NOVAR lexicon that will be used to transcribe the
English words occurring between dollar markers.
Note that this question will only be asked if the answer to Question 3 is “2”, “3” or
“4” => SS/XX/ZZ.
2.33.4 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
2.33.5 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files during the process of running this script.
If working with one or more directories and words with internal zeros are encountered, the
following file will be created in the startup directory:
• TextGridFilesContainingWordsWithInternalZeros.txt
This file will hold for each affected TextGrid, the words with internal zeros. The file
has the same layout as the file shown in the example of Section 2.31.5.
Note that the generation of this file is not necessary anymore, since
WordWithInternalZeros.pl can be used to obtain this information. This functionality
can therefore be removed from the script.
2.34 WordsWithInternalZeros.pl
2.34.1 Overview
Extracts words with internal zeros from orthographic transcriptions. It can work with
TextGrid files containing only orthographic transcriptions, or orthographic and phonetic
transcriptions.
2.34.2 Command Line Options
1) WordsWithInternalZeros.pl
Will search the TextGrid files that can be found in the directories which are specified
in the list file DirectoryList.lst for words with internal zeros.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 55 of 78
2) WordsWithInternalZeros.pl -f
Will only search the specified TextGrid file for words with internal zeros.
3) WordsWithInternalZeros.pl -d
Will only search the specified directory's TextGrid files for words with internal zeros.
4) WordsWithInternalZeros.pl -l
The newly specified list file will be used instead of DirectoryList.lst.
2.34.3 Questions
1) For this script to work properly, the data must be free of normal CFE errors.
Do you want to continue? (y/n)
Answer “y” to this question if the data is free of normal CFE (CheckForErrors.pl)
errors17. Otherwise, answer “n” in order to abort the program.
2.34.4 Generated Output Files
See Section 2.5.5 for the files that will be generated if errors are encountered in the
TextGrid files during the process of running this script.
If working with one or more directories, the following files will be created in startup
directory if words with internal zeros are encountered:
• TextGridFilesContainingWordsWithInternalZeros.txt
This file will hold for each TextGrid, the words with internal zeros that were
encountered. The file has the same layout as the file shown in the example of
Section 2.31.5.
• WordsWithInternalZeros.txt
Holds the unique list of all the words with internal zeros that were encountered.
If working with a single TextGrid file and words with internal zeros are encountered, the
following file will be created in the startup directory:
• WordsWithInternalZeros.txt
2.35 WorkThroughFWEErrorFiles.pl
2.35.1 Overview
This script allows the user to work through those TextGrid files lying under the
FilesWithErrors (FWE) directory containing normal (determined by using
17
Normal CFE errors exclude a) utterance and sentence marker location errors, b) orthographic and phonetic alignment
errors, and c) phonetic symbol errors.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 56 of 78
CheckForErrors.pl and includes utterance and sentence marker location errors,
orthographic and phonetic alignment errors, and phonetic symbol errors), SNBD or
assimilation errors. Note that the TextGrid files must be accompanied by their alaw files in
order for this script to work properly. In addition to this, FilesWithErrors.txt must also occur
in FWE directory. Also note that the FWE directory must be specified in a list file or on the
command line. Only one directory entry will be allowed in the list file.
2.35.2 Command Line Options
1) WorkThroughFWEErrorFiles.pl
Working directory will be extracted from DirectoryList.lst.
2) WorkThroughFWEErrorFiles.pl -d
Working directory will be extracted from command line.
3) WorkThroughFWEErrorFiles.pl -l
Working directory will be extracted from the newly specified list file instead of
DirectoryList.lst.
2.35.3 Questions
1) Choose one of the following
1. Start from scratch
2. Continue with previous session
Specify whether the session should start from the beginning (therefore start with the
first TextGrid in FWE) by answering “1” or to continue with a previous session by
answering “2”.
Note that this question will only be asked if FWEPreviousSession.txt exists in the
startup directory when the program is started. If not, the program will immediately
jump to Question 3.
2) Starting from scratch will overwrite the previous session's information!
Do you want to continue? (y/n)
Answer “y” if you want to overwrite the previous session’s information that is stored
in the startup directory under FWEPreviousSession.txt.
Note that this question will only be asked if the answer to Question 1 is “1”.
3) The TextGrids contain what type of errors
1. Normal CFE errors
2. SNBD (sil/nonsil/boundary/duration) errors
3. Assimilation errors
Specify the type of error that is being worked with.
Note that this question will only be asked if the answer to Question 1 is “1”.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 57 of 78
2.35.4 Programs to Install
The following program must be installed under C:\Program Files\sendpraat:
• sendpraat.exe
This executable allows the script to control Praat.
2.35.5 Files Needed during Startup
The following file must exist in FWE subdirectory in order for the script to work properly:
• FilesWithErrors.txt
This file holds the names of TextGrids containing errors as well as their associated
error messages.
2.35.6 Generated Output Files
The following files in the startup directory will constantly be updated during program
execution:
• FWEPreviousSession.txt
Current session’s information (previous session’s information during next startup)
will constantly be written to this file during program execution.
• FWEPreviousSession_bak.txt
FWEPreviousSession.txt will be saved to this file before it is updated with new
information.
2.36 WorkThroughLex.pl
2.36.1 Overview
This script will enable the user to check the data for errors based on the information
contained in the VAR lexicon18. The user will thus be allowed to work through the lexicon
and then inspect the TextGrid files of those entries that look suspicious. Note that each
TextGrid containing an error can be edited using either Notepad or Praat. In addition to
this, it's alaw can be listened to using either Awave Studio or Praat.
2.36.2 Questions
1) Choose one of the following
1. Start from scratch
2. Continue with previous session
18
A VAR lexicon can contain various phonetic sequences for each orthographic word. The phonetic sequences of an
orthographic word are separated with commas.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 58 of 78
Specify whether the session should start from the beginning (therefore start with the
first entry in the lexicon) by answering “1” or to continue with a previous session by
answering “2”.
Note that this question will only be asked if PreviousSession.txt exists in the startup
directory when the program is started.
2) Starting from scratch will overwrite the previous session's information!
Do you want to continue? (y/n)
Answer “y” if you want to overwrite the previous session’s information that is stored
in the startup directory under PreviousSession.txt.
Note that this question will only be asked if the answer to Question 1 is “1”.
3) Please enter the name of the file holding the VAR lexicon.
The name of the VAR lexicon must be specified that was built up using BuildLex.pl
(Section 2.3).
Note that this question will only be asked if the answer to Question 1 is “1”.
4) Please enter the name of the file holding the location info.
The name of the file holding the location information must be specified that was built
up using BuildLex.pl.
Note that this question will only be asked if the answer to Question 1 is “1”.
2.36.3 Programs to Install
The following programs must be installed in order for the script to work properly:
• sendpraat.exe
This executable allows the script to control Praat and must be installed under
C:\Program Files\sendpraat.
• Awave.exe
This executable allows the user to listen to the alaw files and must be installed
under C:\Program Files\Awave Studio.
2.36.4 Files Needed in Startup Directory
The following file must lie in the startup directory in order for the script to work properly:
• PhoneSets.lst
Contains the XSAMPA and Praat phone sets. For more information about this list
file see Appendix A.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 59 of 78
2.36.5 Generated Output Files
The following files will constantly be updated in the startup directory during program
execution:
• PreviousSession.txt
Current session’s information (previous session’s information during next startup)
will constantly be written to this file during program execution.
• PreviousSession_bak.txt
FWEPreviousSession.txt will be saved in this file before it is updated with new
information.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 60 of 78
3. USING THE SCRIPTS
3.1 Processing Raw Orthographic Transcription Files
The raw orthographic transcription files usually lie in call folders under a base directory.
Each call folder contains alaws, TextGrids and an attribute file. These files need to be
renamed and then copied to the following directories:
• alaw
• attrib
• ort
The following directories must also be created:
• lab
• merge
• phon
In addition to this, the orthographic TextGrids must be checked for transcription errors
before moving and renaming the transcription files.
The following scripts must be run in the order indicated below in order to accomplish these
tasks:
1. CheckNumberOfALawsAndTextGrids.pl (x1)
First of all the, call subdirectories must be checked to see whether they contain the
same number of alaw and orthographic TextGrid files.
2. GetTextGridDirectoriesRecursively.pl (x1)
Produce a list file with all the call folders in it. This list file will be used by the scripts
below.
3. CheckAndReplaceOrtNames.pl (x1)
Correct any orthographic transcription name errors.
4. Cleanup.pl (x1)
The transcriptions are forced to comply with certain specifications. This step is
crucial, since it lightens the workload when correcting the errors pointed out by
CheckForErrors.pl.
5. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are
removed)
Correct any transcription errors that exist in the TextGrid files using these two
scripts. Once the errors have been corrected, copy the TextGrid files back from the
FilesWithErrors directory to the TextGrid directory.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 61 of 78
6. Cleanup.pl (x1)
Run Cleanup.pl one more time to ensure the transcriptions comply with the
specifications with regards to the use of white space.
7. RemoveInvalidIntervals.pl (x1!)
This script will remove any intervals that were incorrectly added at the beginning
and the end of the transcriptions in Praat. However, if the transcribers know what
they are doing, this step can be skipped.
8. ProcessRawTranscriptionData.pl (x1)
The transcription files will be renamed and moved to alaw, attrib, and ort directories.
3.2 Generating Deterministic Phonetic Transcriptions
The following steps must be performed in order to generated TextGrid files containing
orthographic and deterministic phonetic transcriptions. Afterwards, the data must be given
to the transcribers in order for them to perform phonetic correction on the data.
1. Rip.pl
Determine which words do not occur in the NOVAR lexicon. These words and their
phonetic representations must then be included in the lexicon. The number of times
to run this script will depend on the number of lexicons being used during the
transcription process.
2. CheckLexicon.pl (repeat until all errors are removed)
Make sure the NOVAR lexicon(s) contain(s) no errors. If a transcription rule lexicon
is also used, check it as well.
3. Transcribe.pl (x1)
Generate deterministic phonetic transcriptions using lexicon(s) and possibly
transcription rules.
4. CheckForErrors.pl (x1)
Run this script in order to make sure that there are no transcription errors.
Remember to check for orthographic and phonetic alignment errors, and phonetic
symbol errors. If errors exist, follow steps 4 to 6 of Section 3.1 to correct them.
After the deterministic phonetic transcriptions have been generated, move the TextGrid
files from the ort directory to the merge directory. Once this has been done, remove the ort
and phon folders. These directories can be removed, since they were only used when
Patana was still employed to produce the deterministic phonetic transcriptions.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 62 of 78
3.3 Processing Phonetically Corrected Data
After the transcribers have phonetically corrected the data, it must be processed and
checked for errors. The following must be done:
1. Cleanup.pl (x1)
The transcriptions are forced to comply with certain transcription specifications. This
step is crucial, since it lightens the workload when correcting the errors pointed out
by CheckForErrors.pl.
2. BottomToTop.pl (x1)
Those orthographic intervals falling outside of utterance and sentence markers will
be updated to correspond with phonetic intervals. However, if the transcribers
updated both the orthographic and phonetic intervals, this step can be skipped.
3. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are
removed)
Correct any transcription errors that exist in the TextGrid files using these two
scripts. Remember to check for orthographic and phonetic alignment errors, and
phonetic symbol errors. Once the errors have been corrected, copy the TextGrid
files back from the FilesWithErrors directory to the TextGrid directory.
4. Cleanup.pl (x1)
Run Cleanup.pl one final time to ensure the transcriptions comply with the
specifications with regards to the use of white space.
5. Converter.pl (x1)
If the phonetic transcription format is in Praat, convert it to XSAMPA.
6. GenerateLabFiles.pl (x1)
Generate lab files using information contained in TextGrid files.
3.4 Merging Phonetically Corrected Batches of Data
The batches that have been phonetically corrected must be merged at some point. The
following scripts are employed to accomplish this:
1. CopyPhoneCorrectedFiles.pl (x1)
Will copy phonetically corrected data that's scattered over several base directories
to a single target base directory containing alaw/attrib/lab/merge directories.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 63 of 78
2. RemoveDuplicateAttribs.pl (x1)
Removes duplicate attribute files that have been identified in possible earlier
merging sessions from attribute directory by looking at the call numbers contained
within each attribute file.
3. MoveDuplicateAttribs.pl (x1)
Move all remaining duplicate attribute files from attribute directory to a specified
directory. Work through these files and copy those that must be kept back to the
attribute directory.
4. DeleteRedundantALTFilesLookingAtAttribs.pl (x1)
Remove all alaw/lab/TextGrid files which do not have corresponding attribute files.
5. Renamer.pl (x1)
Rename alaw, attrib, lab and TextGrid files in order to get them according spec with
respect to case and counter.
3.5 Correcting More Errors
Once all the batches of a specific language have been merged, SNBD, assimilation and
lexicon errors must corrected.
3.5.1 Sil/Nonsil/Boundary/Duration Errors
First, the data must be sent to the engineering team. They will run programs on the data in
order to determine sil/nonsil/boundary (SNB) errors. The following scripts must then be run
on the data to correct these errors:
1. CopySNBDErrorFilesToFWE.pl (x1)
Copy those TextGrid files – and their alaws – with SNB errors to FilesWithErrors
subdirectory under the TextGrid directory.
2. WorkThroughFWEErrorFiles.pl
Work through those TextGrid files lying under the FilesWithErrors directory
containing SNB errors. Once the errors have been corrected, copy the TextGrid
files back from the FilesWithErrors directory to the TextGrid directory.
3. Cleanup.pl (x1)
The transcriptions are forced to comply with certain specifications. This step is
crucial, since it lightens the workload when correcting the errors pointed out by
CheckForErrors.pl.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 64 of 78
4. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are
removed)
Correct any transcription errors that exist in the TextGrid files using these two
scripts. Once the errors have been corrected, copy the TextGrid files back from the
FilesWithErrors directory to the TextGrid directory.
5. Cleanup.pl (x1)
Run Cleanup.pl one final time to ensure the transcriptions comply with the
specifications with regards to the use of white space.
6. GenerateLabFiles.pl (x1)
Regenerate lab files.
At this point the data is sent back to the engineers in order for them to determine where
duration errors occur. To work through these errors, follow steps 1 to 6 above.
3.5.2 Assimilation Errors
After the SNBD errors have been corrected, the data is again sent to the engineering
team. This time round, a list of possible assimilation errors is produced. Run the following
scripts on the data in order to correct these errors:
1. CopyAssimErrorFilesToFWE.pl (x1)
TextGrids (filenames extracted from list file produced by the engineers) – and their
alaws – with possible assimilation errors are copied to FilesWithErrors (FWE)
subdirectory under the TextGrid directory. The list file produced by the engineering
team must also be copied to FWE subdirectory and should be renamed to
FilesWithErrors.txt.
2. WorkThroughFWEErrorFiles.pl (x1)
Work through those TextGrid files lying under the FWE directory containing
assimilation errors. Once the errors have been corrected, copy the TextGrid files
back from the FilesWithErrors directory to the TextGrid directory.
3. Cleanup.pl (x1)
The transcriptions are forced to comply with certain specifications. This step is
crucial, since it lightens the workload when correcting the errors pointed out by
CheckForErrors.pl.
4. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are
removed)
Correct any transcription errors that exist in the TextGrid files using these two
scripts. Once the errors have been corrected, copy the TextGrid files back from the
FilesWithErrors directory to the TextGrid directory.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 65 of 78
5. Cleanup.pl (x1)
Run Cleanup.pl one final time to ensure the transcriptions comply with the
specifications with regards to the use of white space.
6. GenerateLabFiles.pl (x1)
Regenerate lab files.
3.5.3 Lexicon Errors
A VAR lexicon must now be extracted from the data and worked through in order to
eliminate more errors from the data:
1. BuildLex.pl (x1)
Extract VAR lexicon from TextGrid files.
2. CheckLexicon.pl (repeat until all errors are removed)
Make sure the VAR lexicon contains no errors.
3. WorkThroughLex.pl
Work through VAR lexicon and correct any errors that are found in the TextGrid
files.
4. Cleanup.pl (x1)
The transcriptions are forced to comply with certain specifications. This step is
crucial, since it lightens the workload when correcting the errors pointed out by
CheckForErrors.pl.
5. CheckForErrors.pl & WorkThroughFWEErrorFiles.pl (repeat until all errors are
removed)
Correct any transcription errors that exist in the TextGrid files using these two
scripts. Once the errors have been corrected, copy the TextGrid files back from the
FilesWithErrors directory to the TextGrid directory.
6. Cleanup.pl (x1)
Run Cleanup.pl one final time to ensure the transcriptions comply with the
specifications with regards to the use of white space.
7. GenerateLabFiles.pl (x1)
Regenerate lab files.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 66 of 78
3.6 Final Processing of Information
The data, VAR lexicon and counts lexicon can be delivered to the engineering team after
the following scripts have been run on the data:
1. ExtractTranscriptions.pl (x1)
Produce summary of orthographic and phonetic transcriptions occurring in the
TextGrid files.
2. BuildLex.pl (x1)
Extract VAR lexicon from TextGrid files. In addition to this, a counts lexicon is also
produced.
3. CheckLexicon.pl (x1)
Make sure the VAR lexicon contains no errors.
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 67 of 78
APPENDIX A – PHONESETS.LST
XSAMPA Praat
p p
p_h p^h
p_> p'
p-\S_> p\-v\sh'
p-\S_h p\-v\sh^h
b b
b_0 b\0v
b-\Z b\-v\zh
t t
t_h t^h
t_> t'
t-\K_> t\-v\l-'
t-\S t\-v\sh
t-\S_h t\-v\sh^h
t-\s_> t\-vs'
t-\s_h t\-vs^h
d d
d_0 d\0v
d-\Z d\-v\zh
d-\K\ d\-v\lz
d-\z d\-vz
c_> c'
c_h c^h
J\ \j-
K k
k_h k^h
k_> k'
k-\x_> k\-vx'
g \gs
? \?g
m m
m= m\|v
n n
J \nj
N \ng
r r
R\ \rc
4 \fh
f f
v v
T \te
D \dh
s s
z z
S \sh
Z \zh
x x
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 68 of 78
h h
h\ \h^
K \l-
K\ \lz
r\ \rt
j j
l l
|\ \|1
|\~ \|1\~^
|\_v~ \|1\~^\vv
|\_v \|1\vv
|\_h \|1^h
|\|\ \|2
|\|\~ \|2\~^
|\|\_v~ \|2\~^\vv
|\|\_v \|2\vv
|\|\_h \|2^h
!\ !
!\~ !\~^
!\_v~ !\~^\vv
!\_v !\vv
!\_h !^h
b_]
k=[k_>]
ll=[l=]
l=[l]
mm=[m=]
m=[m]
ntjh=[J t-\S_h]
ntj=[J t-\S]
ny=[J]
nq=[!\~]
nk=[N k_>]
ng=[N]
n=[n]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 72 of 78
oCCCu=[o]
oCCCi=[o]
oCCu=[o]
oCCi=[o]
oCu=[o]
oCi=[o]
oo=[O:]
o=[O]
pjh=[p-\S_h]
pj=[p-\S_>]
ph=[p_h]
p=[p_>]
qh=[!\_h]
q=[!\]
rr=[r=]
r=[r]
sh=[S]
s=[s]
tsh=[t-\s_h]
tlh=[t-\K_h]
tjh=[t-\S_h]
ts=[t-\s_>]
tl=[t-\K_>]
tj=[t-\S]
th=[t_h]
t=[t_>]
uu=[u:]
u=[u]
w=[w]
y=[j]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 73 of 78
APPENDIX C – XHOSA TRANSCRIPTION RULES
Cam_=[a m=]
aa=[a:]
a=[a]
bh=[b_0]
b=[b_]
k=[k_>]
l=[l]
_mna_=[m= n a]
mhl=[m K]
mb=[m b]
mh=[m]
m=[m]
ntyh=[J c_h]
ntsh=[J t-\s_>]
ndl=[n d-\K\]
ndy=[J J\]
ngc=[|\_v~]
ngq=[!\_v~]
ngx=[|\|\_v~]
nkc=[N |\]
nkh=[N k_h]
nkq=[N !\]
nkw=[N k_> w]
nkx=[N |\|\]
nty=[J c_>]
nyh=[J]
nkV=[N k_>]
nc=[|\~]
ng=[N g]
nj=[J d-\Z]
nq=[!\~]
nx=[|\|\~]
ny=[J]
nz=[n d-\z]
n=[n]
oCCCCi=[o]
oCCCCu=[o]
oCCCi=[o]
oCCCu=[o]
oCCi=[o]
oCCu=[o]
oCi=[o]
oCu=[o]
oo=[O:]
o=[O]
ph=[p_h]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 75 of 78
p=[p_>]
qh=[!\_h]
q=[!\]
rh=[x]
r=[r]
sh=[S]
s=[s]
ths=[t-\s_>]
tsh=[t-\S_h]
tyh=[c_h]
th=[t_h]
tl=[t-\K_>]
ts=[t-\s_>]
ty=[c_>]
t=[t_>]
_umbh=[u m= b]
_umb=[u m= b]
_umC=[u m=]
uu=[u:]
u=[u]
v=[v]
w=[w]
xh=[|\|\_h]
x=[|\|\]
y=[j]
z=[z]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 76 of 78
APPENDIX D – ZULU TRANSCRIPTION RULES
aa=[a:]
a=[a]
bh=[b_0]
b=[b_]
k=[k_>]
l=[l]
mhl=[m K]
mb=[m b]
m=[m]
ntsh=[J t S]
ndl=[n d-\K\]
ngc=[|\_v~]
ngq=[!\_v~]
ngx=[|\|\_v~]
nkc=[N |\]
nkq=[N !\]
nkw=[N k_> w]
nkx=[N |\|\]
nkV=[N k_>]
nc=[|\~]
ng=[N g]
nj=[n d-\Z]
nq=[!\~]
nx=[|\|\~]
ny=[J]
nz=[n d-\z]
n=[n]
oCCCCi=[o]
oCCCCu=[o]
oCCCi=[o]
oCCCu=[o]
oCCi=[o]
oCCu=[o]
oCi=[o]
oCu=[o]
oo=[O:]
o=[O]
ph=[p_h]
p=[p_>]
qh=[!\_h]
q=[!\]
r=[r]
sh=[S]
s=[s]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation
AST Confidential Page 78 of 78
tsh=[t-\S_h]
th=[t_h]
ts=[t-\s_>]
t=[t_>]
uu=[u:]
u=[u]
v=[v]
w=[w]
xh=[|\|\_h]
x=[|\|\]
y=[j]
z=[z]
Language Team Perl Scripts for AST Database Validation, Quality Control, and Manipulation