Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Get this document free

NLP 3

VIEWS: 4 PAGES: 47

									Tokenisation and The Regular
         Expression
TOKENISATION AND SENTENCE
      SEGMENTATION
             Tokenisation
• Breaking up the sequence of characters in a
  text by locating the word boundaries
  – Word segmentation
  – Sentence segmentation
  Why is text segmentation challenging?

• Language dependence
  – Logographic
  – Syllabic
  – Alphabetic
  – Also orthographic conventions
• Character-set dependence
  – 7-bit ASCII
  – 8-bit encoding
  – Two-byte encodings
  Why is text segmentation challenging?

• Application dependence
  – Tokenisation depends on later-processing stages
• Corpus dependence
  – Increasing availability of large corpora in multiple
    languages with range of data types
                Tokenisation
• Factors affecting difficulty of tokenising
  natural languages:
  – Space-delimited languages
  – Unsegmented languages
 Tokenisation in space-delimited languages

• Tokenisation ambiguity

• Clairson International Corp. said it expects to
  report a net loss for its second quarter ended
  March 26 and doesn’t expect to meet analysts’
  profit estimates of $3.9 to $4 million, or 76
  cents a share to 79 cents a share, for its year
  ending Sept. 24.
         Tokenising punctuation
•   Abbreviations
•   Quotation marks and apostrophes
•   Multipart words
•   Multiword expressions
         Sentence segmentation
•   Full stops
•   Question marks
•   Exclamation marks
•   Semi-colons, colons, dashes and commas
•   Here is a sentence. Here is another.
•   Here is a sentence; here is another.
                 Conclusion
• Dismissed as ‘pre-processing’
• Early systems designed to process small texts
  in single language
• Explosion in availability of large unrestricted
  corpora leads to consideration of challenges
  posed by processing unrestricted texts.
• Errors at text segmentation stage directly
  affect all later processing stages
  Spam filters are built on tokenisation

• Older filters, author programs characteristics
  of spam
• Newer filters are statistical, automatically
  identify spam features based on message
  content
• Data generated by tokeniser passed to
  analysis engine for interpreting
                      Basic delimiters
• What constitutes a delimiter?
• What is a constituent character?
• First delimiter – the space
For a Confidential Phone Interview, Please
  Complete Form & Submit.

For A             Confidential   Phone   Interview,
Please   Complete Form           &       Submit.
                          Redundancy
• Exclamation mark             • Other delimiters
   –   Free                       –   Brackets [ ]
   –   Free!                      –   Braces { }
   –   Free!    Free!!!           –   Parenthesis ( )
   –   F!R!E!E!                   –   Mathematical operators
                                  –   Special characters
                                  –   The at sign
                                  –   underscores
           Token reassembly
• Obfuscated text:
C/A/L/L/ N-O-W – I/T/S F_R_E_E
Concatenate single-character tokens which are
   adjacent
                     Header Optimisation
From: bazz@xum2.xumx.com
To: bazz@xum2.xumx.com
Reply-To: mort239o@xum2.xumx.com
Subject: ADV: FREE Mortgage Rate Quote - Save THOUSANDS! kplxl
X-Keywords:

Save thousands by refinancing now. Apply from the privacy of your home and
receive a FREE no-obligation loan quote.
http://211.78.96.11/acct/morquote/

Rates are Down. YOU Win!
Self-Employed or Poor Credit is OK!

Get CASH out or money for Home Improvements, Debt Consolidation and more.
Interest rates are at the lowest point in years-right now! This is the perfect
time for you to get a FREE quote and find out how much you can save!
         Background to the RE
• Context:
• Earliest computers were number processors
• Computers represent linguistic numbers in
  non-linguistic ways
  – (ASCII codes) GO:
     • 01000111 (71)
     • 01001111 (79)
• Computational linguistics compiles statistics,
  derives indexes and concordances
Concordances
Machine translation and languages
•   US military and intelligence
•   Computers are good at symbol manipulation
•   FORTRAN, v. APL, Pascal, Prolog
•   1971 Winograd’s SHRDLU program
    – Written in LISP
• Procedural v. declarative languages
• Logic programming languages
    – Programmer specifies grammar
    – Computer generates example sentences allowed by
      grammar, decides whether given sentences are
      grammatical
           Regular Expression
• Standard notation for characterising text
  sequences
• Used in:
  – Web search
  – Info retrieval
  – Word-processing
  – Computation of frequencies
• Regular expressions implemented via finite-
  state automaton
                         REs
• What is the point of an RE?
  – To specify textual search strings
  – To specify the design of a particular kind of
    machine


• These are equivalent
                      RE
• Developed by Kleene (1956)
• Formula in special language used for
  specifying simple classes of strings
• Used in Perl, but also UNIX, MS Word, etc
• RE search requires pattern and corpus
• Assumption: search engine returns the line of
  the document
          Matching literal text
TEXT

Hello, my name is Maeve. Please visit my website at
http://www.infm.ulst.ac.uk .
REGEX

Maeve
RESULT

Hello, my name is Maeve. Please visit my website at
http://www.infm.ulst.ac.uk .
         Matching literal text (2)
TEXT

Hello, my name is Maeve. Please visit my website at
http://www.infm.ulst.ac.uk/ .
REGEX

my
RESULT

Hello, my name is Maeve. Please visit my website at
http://www.infm.ulst.ac.uk/ .
            Matching any characters
TEXT                    RESULT
   sales1.xls              sales1.xls
   orders3.xls             orders3.xls
   sales2.xls              sales2.xls
   sales3.xls              sales3.xls
   apac1.xls               apac1.xls
   europe2.xls             europe2.xls
   na1.xls                 na1.xls
   na2.xls                 na2.xls
   sa1.xls                 sa1.xls

REGEX
   sales.
            Matching any characters
TEXT                    RESULT
   sales.xls               sales.xls
   sales1.xls              sales1.xls
   orders3.xls             orders3.xls
   sales2.xls              sales2.xls
   sales3.xls              sales3.xls
   apac1.xls               apac1.xls
   europe2.xls             europe2.xls
   na1.xls                 na1.xls
   na2.xls                 na2.xls
   sa1.xls                 sa1.xls

REGEX
   sales.
  Find all files for North (na) and South America
                          (sa)
TEXT                      RESULT
   sales1.xls                sales1.xls
   orders3.xls               orders3.xls
   sales2.xls                sales2.xls
   sales3.xls                sales3.xls
   apac1.xls                 apac1.xls
   europe2.xls               europe2.xls
   na1.xls                   na1.xls
   na2.xls                   na2.xls
   sa1.xls                   sa1.xls

REGEX
   .a.
                 Try again…
TEXT                  RESULT
   sales1.xls            sales1.xls
   orders3.xls           orders3.xls
   sales2.xls            sales2.xls
   sales3.xls            sales3.xls
   apac1.xls             apac1.xls
   europe2.xls           europe2.xls
   na1.xls               na1.xls
   na2.xls               na2.xls
   sa1.xls               sa1.xls

REGEX
   .a..
                 Escape . With \.
TEXT                     RESULT
   sales1.xls               sales1.xls
   orders3.xls              orders3.xls
   sales2.xls               sales2.xls
   sales3.xls               sales3.xls
   apac1.xls                apac1.xls
   europe2.xls              europe2.xls
   na1.xls                  na1.xls
   na2.xls                  na2.xls
   sa1.xls                  sa1.xls

REGEX
   .a.\.xls
        Matching sets of characters
TEXT                  RESULT
   sales1.xls            sales1.xls
   orders3.xls           orders3.xls
   sales2.xls            sales2.xls
   sales3.xls            sales3.xls
   apac1.xls             apac1.xls
   europe2.xls           europe2.xls
   na1.xls               na1.xls
   na2.xls               na2.xls
   sa1.xls               sa1.xls
   ca1.xls               ca1.xls

REGEX
   [ns]a.\.xls
         Avoiding case sensitivity
TEXT

The phrase ‘regular expression’ is often abbreviated as
RegEx or regex.
REGEX

[Rr]eg[Ee]x
RESULT

The phrase ‘regular expression’ is often abbreviated as
RegEx or regex.
          Using character set ranges
TEXT                           RESULT
   sales1.xls                     sales1.xls
   orders3.xls                    orders3.xls
   sales2.xls                     sales2.xls
   sales3.xls
                                  sales3.xls
   apac1.xls
   europe2.xls                    apac1.xls
   sam.xls                        europe2.xls
   na1.xls                        sam.xls
   na2.xls                        na1.xls
   sa1.xls                        na2.xls
   ca1.xls                        sa1.xls
                                  ca1.xls
REGEX
   [ns] a [0123456789] \.xls
        Using character set ranges (2)
TEXT                    RESULT
   sales1.xls              sales1.xls
   orders3.xls             orders3.xls
   sales2.xls              sales2.xls
   sales3.xls
                           sales3.xls
   apac1.xls
   europe2.xls             apac1.xls
   sam.xls                 europe2.xls
   na1.xls                 sam.xls
   na2.xls                 na1.xls
   sa1.xls                 na2.xls
   ca1.xls                 sa1.xls
                           ca1.xls
REGEX
   [ns] a [0-9] \.xls
            ‘anything but’ matching
TEXT                     RESULT
   sales1.xls               sales1.xls
   orders3.xls              orders3.xls
   sales2.xls               sales2.xls
   sales3.xls
                            sales3.xls
   apac1.xls
   europe2.xls
                            apac1.xls
   sam.xls                  europe2.xls
   na1.xls                 sam.xls
   na2.xls                 na1.xls
   sa1.xls                 na2.xls
   ca1.xls                 sa1.xls
                           ca1.xls
REGEX
   [ns] a [^0-9] \.xls
               Basic RE patterns
• To search for salmon, type /salmon/
• To search for Maeve, type /Maeve/

RE             Example patterns
/salmon/       ‘interesting links to salmon and trout’
/a/            ‘Lara Croft has had amazing success.’
/John_says,/   ‘It’s cold out,’ John says, ‘
/weed/         ‘the garden is full of weeds’
/!/            ‘What a lovely day!’ said Anne
                     RE patterns
• Case sensitive
   /m/ distinct from /M/
   /maeve/ won’t match /Maeve/
   Use square brackets to specify disjunction
     /[mM]/
RE                     Match
/[mM]ouse/             Mouse or mouse, ‘Mouse’
/[abc]/                ‘a’, ‘b’, or ‘c’, ‘Hats off!’
/[1234567890]/         Any digit, ‘she works 9 to 5.’
                         The caret ^
      Use square braces plus caret to specify what single
       character cannot be.
      /[^a]/
         Matches any single character except a

RE             Match
[^A-Z]         Not an uppercase letter ‘Maeve’
[^Ss]          Neither S not s ‘Maeve’
[^\.]          Not a full stop ‘Maeve’
[e^]           Either e or ^ ‘try here ^ now’
a^b            ‘a^b’ ‘try here a^b later’
               The question mark ?
 • How do we specify ‘boat’ and ‘boats’?
 • Square brackets allow us to specify ‘B’ or ‘b’, but not ‘s or
   nothing’
 • ? Means the preceding character or nothing
RE              Match
/boats?/        boat or boats ‘boats’
/neighbou?r/    Neighbor or neighbour ‘neighbor’


 Question mark means zero or one instances of previous
 character = optionality
                    Repetitions
• Sheep language…
   Baa!
   Baaa!
   Baaaa!
   Baaaaa!
• B followed by at least 2 as and an exclamation mark
• Solution: Kleene *
• Zero or more occurrences of immediately previous
  character or regular expression
                     Kleene *
• /a*/
  – Any string of zero or more as
  – This matches aor aaa, but also University of Ulster
  – To match one or more as:
     /aa*/
• Further pattern:
  /[ab]*/
       Specifying multiple digits
• Using Kleene *
  /[0-9][0-9]*/
• Shorter way to specify ‘at least one character’
• Kleene +
  – One or more of previous character
  /[0-9]+/
• To specify the sheep language:
  /baaa*!/
  OR: /baa+!/
                  Wildcard
• Wildcard matches any single character (except
  return)
  /r.n/
  Any character between r and n
  run, ran
• Use wildcard with Kleene * to specify any
  string of characters
• To find any line where salmon appears twice:
  /salmon.*salmon/
                       Anchors
• Special characters to anchor REs to particular places in
  string
• The caret ^
   – Matches start of line /^The/
   – Negation inside square brackets
   – A caret!
• The dollar sign $
   – Matches end of line _$ /^The end\.$/
• B and b
   – Matches word boundary \b /\bthe\b /
   – Matches non-word boundary \B
                          Disjunction
•       Searching for pages about fly-fishing
•       Interested in trout and salmon
•       Use disjunction operator: |
                  /salmon|trout/
•       Interested in salmon and trout flies
•       How to specify fly and flies?
    –      WRONG: /fly|ies/
    –      Fly takes precedence
•       Solution: use parenthesis operators: ( )
                  /fl(y|ies)/
      Parenthesis plus Kleene *
• Pipe applies to whole sequence, Kleene *
  applies to single character
• Matching repeated instances of string like
  item 1, item 2, item 3…
  – WRONG:         /item_[0-9]+_*/
  – * only applies to space it precedes
  – Solution: parentheses
  /(item_[0-9]+_*)*/
                  Example
• How to find instances of over
 /over/
 /[oO]ver/
 /\b[oO]ver\b/
 Without \b which has a problem with underscores
      and numbers as boundaries
 /[^a-zA-Z][oO[ver[^a-zA-Z]/
 OR`
 /(^|[^a-zA-Z]) [oO]ver[^a-zA-Z]/
             Further operators
RE      Match
\d      Any digit
\D      Any non-digit
\w      Any alphanumeric or underscore
\W      A non-alphanumeric
\s      Whitespace (space, tab)
\S      Non-whitespace


Aliases for common ranges which save typing

								
To top