Docstoc

Domain Adaptation with Latent Semantic Association for Named

Document Sample
Domain Adaptation with Latent Semantic Association for Named Powered By Docstoc
					                 Domain Adaptation with Latent Semantic Association
                          for Named Entity Recognition


    Honglei Guo Huijia Zhu Zhili Guo Xiaoxun Zhang Xian Wu and Zhong Su
                          IBM China Research Laboratory
                               Beijing, P. R. China
 {guohl, zhuhuiji, guozhili, zhangxx, wuxian, suzhong}@cn.ibm.com




                      Abstract                               important task in information extraction and natu-
                                                             ral language processing (NLP) applications. Super-
     Domain adaptation is an important problem in            vised learning methods can effectively solve NER
     named entity recognition (NER). NER classi-             problem by learning a model from manually labeled
     fiers usually lose accuracy in the domain trans-         data (Borthwick, 1999; Sang and Meulder, 2003;
     fer due to the different data distribution be-          Gao et al., 2005; Florian et al., 2003). However, em-
     tween the source and the target domains. The            pirical study shows that NE types have different dis-
     major reason for performance degrading is
                                                             tribution across domains (Guo et al., 2006). Trained
     that each entity type often has lots of domain-
     specific term representations in the different           NER classifiers in the source domain usually lose
     domains. The existing approaches usually                accuracy in a new target domain when the data dis-
     need an amount of labeled target domain data            tribution is different between both domains.
     for tuning the original model. However, it                 Domain adaptation is a challenge for NER and
     is a labor-intensive and time-consuming task
                                                             other NLP applications. In the domain transfer,
     to build annotated training data set for every
     target domain. We present a domain adapta-
                                                             the reason for accuracy loss is that each NE type
     tion method with latent semantic association            often has various specific term representations and
     (LaSA). This method effectively overcomes               context clues in the different domains. For ex-
     the data distribution difference without lever-         ample, {“economist”, “singer”, “dancer”, “athlete”,
     aging any labeled target domain data. LaSA              “player”, “philosopher”, ...} are used as context
     model is constructed to capture latent seman-           clues for NER. However, the distribution of these
     tic association among words from the unla-              representations are varied with domains. We expect
     beled corpus. It groups words into a set of
                                                             to do better domain adaptation for NER by exploit-
     concepts according to the related context snip-
     pets. In the domain transfer, the original term         ing latent semantic association among words from
     spaces of both domains are projected to a con-          different domains. Some approaches have been pro-
     cept space using LaSA model at first, then the           posed to group words into “topics” to capture im-
     original NER model is tuned based on the se-            portant relationships between words, such as Latent
     mantic association features. Experimental re-           Semantic Indexing (LSI) (Deerwester et al., 1990),
     sults on English and Chinese corpus show that           probabilistic Latent Semantic Indexing (pLSI) (Hof-
     LaSA-based domain adaptation significantly
                                                             mann, 1999), Latent Dirichlet Allocation (LDA)
     enhances the performance of NER.
                                                             (Blei et al., 2003). These models have been success-
                                                             fully employed in topic modeling, dimensionality
                                                             reduction for text categorization (Blei et al., 2003),
1   Introduction
                                                             ad hoc IR (Wei and Croft., 2006), and so on.
Named entities (NE) are phrases that contain names             In this paper, we present a domain adaptation
of persons, organizations, locations, etc. NER is an         method with latent semantic association. We focus

                                                       281
 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 281–289,
                  Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics
on capturing the hidden semantic association among        Daume III (2007) further augments the feature space
words in the domain adaptation. We introduce the          on the instances of both domains. Jiang and Zhai
LaSA model to overcome the distribution difference        (2006) exploit the domain structure contained in the
between the source domain and the target domain.          training examples to avoid over-fitting the training
LaSA model is constructed from the unlabeled cor-         domains. Arnold et al. (2008) exploit feature hier-
pus at first. It learns latent semantic association        archy for transfer learning in NER. Instance weight-
among words from their related context snippets.          ing (Jiang and Zhai, 2007) and active learning (Chan
In the domain transfer, words in the corpus are as-       and Ng, 2007) are also employed in domain adap-
sociated with a low-dimension concept space using         tation. Most of these approaches need the labeled
LaSA model, then the original NER model is tuned          target domain samples for the model estimation in
using these generated semantic association features.      the domain transfer. Obviously, they require much
The intuition behind our method is that words in one      efforts for labeling the target domain samples.
concept set will have similar semantic features or           Some approaches exploit the common structure of
latent semantic association, and share syntactic and      related problems. Ando et al. (2005) learn pred-
semantic context in the corpus. They can be consid-       icative structures from multiple tasks and unlabeled
ered as behaving in the same way for discriminative       data. Blitzer et al. (2006, 2007) employ structural
learning in the source and target domains. The pro-       corresponding learning (SCL) to infer a good fea-
posed method associates words from different do-          ture representation from unlabeled source and target
mains on a semantic level rather than by lexical oc-      data sets in the domain transfer. We present LaSA
currence. It can better bridge the domain distribu-       model to overcome the data gap across domains by
tion gap without any labeled target domain samples.       capturing latent semantic association among words
Experimental results on English and Chinese corpus        from unlabeled source and target data.
show that LaSA-based adaptation significantly en-             In addition, Miller et al. (2004) and Freitag
hances NER performance across domains.                    (2004) employ distributional and hierarchical clus-
   The rest of this paper is organized as follows. Sec-   tering methods to improve the performance of NER
tion 2 briefly describes the related works. Section 3      within a single domain. Li and McCallum (2005)
presents a domain adaptation method based on latent       present a semi-supervised sequence modeling with
semantic association. Section 4 illustrates how to        syntactic topic models. In this paper, we focus on
learn LaSA model from the unlabeled corpus. Sec-          capturing hidden semantic association among words
tion 5 shows experimental results on large-scale En-      in the domain adaptation.
glish and Chinese corpus across domains, respec-
tively. The conclusion is given in Section 6.             3   Domain Adaptation Based on Latent
                                                              Semantic Association
2   Related Works
                                                          The challenge in domain adaptation is how to cap-
Some domain adaptation techniques have been em-           ture latent semantic association from the source and
ployed in NLP in recent years. Some of them               target domain data. We present a LaSA-based do-
focus on quantifying the generalizability of cer-         main adaptation method in this section.
tain features across domains. Roark and Bacchiani            NER can be considered as a classification prob-
(2003) use maximum a posteriori (MAP) estimation          lem. Let X be a feature space to represent the ob-
to combine training data from the source and target       served word instances, and let Y be the set of class
domains. Chelba and Acero (2004) use the param-           labels. Let ps (x, y) and pt (x, y) be the true under-
eters of the source domain maximum entropy clas-          lying distributions for the source and the target do-
sifier as the means of a Gaussian prior when train-        mains, respectively. In order to minimize the efforts
ing a new model on the target data. Daume III and         required in the domain transfer, we often expect to
Marcu (2006) use an empirical Bayes model to esti-        use ps (x, y) to approximate pt (x, y).
mate a latent variable model grouping instances into         However, data distribution are often varied with
domain-specific or common across both domains.             the domains. For example, in the economics-to-

                                                    282
entertainment domain transfer, although many NE           to build LaSA model from words and their context
triggers (e.g. “company” and “Mr.”) are used in           snippets in this section. LaSA model actually can
both domains, some are totally new, like “dancer”,        be considered as a general probabilistic topic model.
“singer”.     Moreover, many useful words (e.g.           It can be learned on the unlabeled corpus using the
“economist”) in the economics NER are useless in          popular hidden topic models such as LDA or pLSI.
the entertainment domain. The above examples
show that features could change behavior across do-       4.1    Virtual Context Document
mains. Some useful predictive features from one do-       The distribution of content words (e.g. nouns, adjec-
main are not predictive or do not appear in another       tives) is usually varied with domains. Hence, in the
domain. Although some triggers (e.g. “singer”,            domain adaptation, we focus on capturing the latent
“economist”) are completely distinct for each do-         semantic association among content words. In or-
main, they often appear in the similar syntactic and      der to learn latent relationships among words from
semantic context. For example, triggers of per-           the unlabeled corpus, each content word is charac-
son entity often appear as the subject of “visited”,      terized by a virtual context document as follows.
“said”, etc, or are modified by “excellent”, “popu-           Given a content word xi , the virtual context docu-
lar”, “famous” etc. Such latent semantic association      ment of xi (denoted by vdxi ) consists of all the con-
among words provides useful hints for overcoming          text units around xi in the corpus. Let n be the total
the data distribution gap of both domains.                number of the sentences which contain xi in the cor-
   Hence, we present a LaSA model θs,t to cap-            pus. vdxi is constructed as follows.
ture latent semantic association among words in the               vdxi = {F (xs1 ), ..., F (xsk ), ..., F (xsn )}
                                                                              i              i              i
domain adaptation. θs,t is learned from the unla-            where, F (xsk ) denotes the context feature set of
                                                                         i
beled source and target domain data. Each instance        xi in the sentence sk , 1 ≤ k ≤ n.
is characterized by its co-occurred context distribu-        Given the context window size {-t, t} (i.e. pre-
tion in the learning. Semantic association feature        vious t words and next t words around xi in sk ).
in θs,t is a hidden random variable that is inferred      F (xsk ) usually consists of the following features.
                                                               i
from data. In the domain adaptation, we transfer the
problem of semantic association mapping to a pos-           1. Anchor unit Axi : the current focused word unit xi .
                                                                            C

terior inference task using LaSA model. Latent se-          2. Left adjacent unit Axi : The nearest left adjacent
                                                                                     L
mantic concept association set of a word instance x            unit xi−1 around xi , denoted by AL (xi−1 ).
(denoted by SA(x)) is generated by θs,t . Instances
in the same concept set are considered as behaving          3. Right adjacent unit Axi : The nearest right adjacent
                                                                                      R
                                                               unit xi+1 around xi , denoted by AR (xi+1 ).
in the same way for discriminative learning in both
domains. Even though word instances do not ap-                                          x
                                                            4. Left context set CLi : the other left adjacent units
pear in a training corpus (or appear rarely) but are in        {xi−t , ..., xi−j , ..., xi−2 } (2 ≤ j ≤ t) around xi , de-
similar context, they still might have relatively high         noted by {CL (xi−t ), ..., CL (xi−j ), ..., CL (xi−2 )}.
probability in the same semantic concept set. Obvi-                                     x
                                                            5. Right context set CRi : the other right adjacent units
ously, SA(x) can better bridge the gap between the             {xi+2 , ..., xi+j , ..., xi+t } (2 ≤ j ≤ t ) around xi , de-
two distributions ps (y|x) and pt (y|x). Hence, LaSA           noted by {CR (xi+2 ), ..., CR (xi+j ), ..., CR (xi+t )}.
model can enhance the estimate of the source do-
main distribution ps (y|x; θs,t ) to better approximate      For example, given xi =“singer”, sk =“This popu-
the target domain distribution pt (y|x; θs,t ).           lar new singer attended the new year party”. Let
                                                          the context window size be {-3,3}. F (singer)
4   Learning LaSA Model from Virtual                      = {singer, AL (new), AR (attend(ed)), CL (this),
    Context Documents                                     CL (popular), CR (the), CR (new) }.
                                                             vdxi actually describes the semantic and syntac-
In the domain adaptation, LaSA model is employed          tic feature distribution of xi in the domains. We
to find the latent semantic association structures of      construct the feature vector of xi with all the ob-
“words” in a text corpus. We will illustrate how          served context features in vdxi . Given vdxi =

                                                    283
{f1 , ..., fj , ..., fm }, fj denotes jth context feature    Algorithm 1: LaSA Model Training
around xi , 1 ≤ j ≤ m, m denotes the total num-               1   Inputs:
                                                              2   • Unlabeled data set: Du ;
ber of features in vdxi . The value of fj is calculated       3   Outputs:
                                                              4   • LaSA model: θs,t ;
by Mutual Information (Church and Hanks, 1990)                5   Initialization:
                                                              6   • Virtual context document set: V Ds,t = ∅;
between xi and fj .                                           7   • Candidate content word set: Xs,t = ∅;
                                                              8   Steps:
                                                              9
                                      P (fj , xi )           10
                                                                  begin
          W eight(fj , xi ) = log2                    (1)                 foreach content word xi ∈ Du do
                                     P (fj )P (xi )          11         if Frequency(xi )≥ the predefined threshold then
                                                             12                AddT o(xi , Xs,t );

   where, P (fj , xi ) is the joint probability of xi and    13
                                                             14
                                                                        foreach xk ∈ Xs,t do
                                                                              foreach sentence Si ∈ Du do
fj co-occurred in the corpus, P (fj ) is the probabil-       15                     if xk ∈ Si then
                                                                                                S
                                                             16                            F (xk i ) ←−
ity of fj occurred in the corpus. P (xi ) is the proba-                                             x      x      x
                                                                                           {xk , ALk , ARk , CLk , CRk };
                                                                                                                          x

bility of xi occurred in the corpus.                                                                  S
                                                                                           AddT o(F (x i ), vdxk );
                                                                                                      k

                                                             17               AddT o(vdxk , V Ds,t );
4.2 Learning LaSA Model
                                                             18         • Generate LaSA model θs,t with Dirichlet distribution on V Ds,t .
Topic models are statistical models of text that posit       19   end

a hidden space of topics in which the corpus is em-
bedded (Blei et al., 2003). LDA (Blei et al., 2003) is
a probabilistic model that can be used to model and         LDA. LaSA model θs,t with Dirichlet distribution is
discover underlying topic structures of documents.          generated on the virtual context document set V Ds,t
LDA assumes that there are K “topics”, multinomial          using the algorithm presented by Blei et al (2003).
distributions over words, which describes a collec-
                                                             1                     2                  3                     4               5
tion. Each document exhibits multiple topics, and            customer           theater           company                Beijing          music
each word in each document is associated with one            president       showplace          government             Hongkong            film
                                                             singer           courtyard          university               China            arts
of them. LDA imposes a Dirichlet distribution on             manager            center          community                 Japan          concert
the topic mixture weights corresponding to the doc-          economist            city              team               Singapore          party
                                                             policeman       gymnasium           enterprise            New York           Ballet
uments in the corpus. The topics derived by LDA              reporter           airport             bank                 Vienna           dance
seem to possess semantic coherence. Those words              director           square             market               America           song
with similar semantics are likely to occur in the same       consumer            park           organization              Korea           band
                                                             dancer            building            agency             international       opera
topic. Since the number of LDA model parameters
depends only on the number of topic mixtures and            Table 1: Top 10 nouns from 5 randomly selected topics
vocabulary size, LDA is less prone to over-fitting           computed on the economics and entertainment domains
and is capable of estimating the probability of un-
observed test documents. LDA is already success-               LaSA model learns the posterior distribution to
fully applied to enhance document representations           decompose words and their corresponding virtual
in text classification (Blei et al., 2003), information      context documents into topics. Table 1 lists top 10
retrieval (Wei and Croft., 2006).                           nouns from a random selection of 5 topics computed
   In the following, we illustrate how to construct         on the unlabeled economics and entertainment do-
LDA-style LaSA model θs,t on the virtual con-               main data. As shown, words in the same topic are
text documents. Algorithm 1 describes LaSA                  representative nouns. They actually are grouped into
model training method in detail, where, Function            broad concept sets. For example, set 1, 3 and 4
AddT o(data, Set) denotes that data is added to             correspond to nominal person, nominal organization
Set. Given a large-scale unlabeled data set Du              and location, respectively. With a large-scale unla-
which consists of the source and target domain data,        beled corpus, we will have enough words assigned
virtual context document for each candidate content         to each topic concept to better approximate the un-
word is extracted from Du at first, then the value of        derlying semantic association distribution.
each feature in a virtual context document is calcu-           In LDA-style LaSA model, the topic mixture
lated using its Mutual Information ( see Equation 1         is drawn from a conjugate Dirichlet prior that re-
in Section 4.1) instead of the counts when running          mains the same for all the virtual context docu-

                                                      284
ments. Hence, given a word xi in the corpus, we                                            5.1 Experimental setting
may perform posterior inference to determine the                                           In the NER domain adaptation, nouns and adjectives
conditional distribution of the hidden topic feature                                       make a significant impact on the performance. Thus,
variables associated with xi . Latent semantic asso-                                       we focus on capturing latent semantic association
ciation set of xi (denoted by SA(xi )) is generated                                        for high-frequency nouns and adjectives (i.e. occur-
using Algorithm 2. Here, Multinomial(θs,t (vdxi ))                                         rence count ≥ 50 ) in the unlabeled corpus. LaSA
refers to sample from the posterior distribution over                                      models for nouns and adjectives are learned from
topics given a virtual document vdxi . In the domain                                       the unlabeled corpus using Algorithm 1 (see section
adaptation, we do semantic association inference on                                        4.2), respectively. Our empirical study shows that
the source domain training data using LaSA model                                           better adaptation is obtained with a 50-topic LaSA
at first, then the original source domain NER model                                         model. Therefore, we set the number of topics N as
is tuned on the source domain training data set by                                         50, and define the context view window size as {-
incorporating these generated semantic association                                         3,3} (i.e. previous 3 words and next 3 words) in the
features.                                                                                  LaSA model learning. LaSA features for other irre-
                                                                                           spective words (e.g. token unit “the”) are assigned
 Algorithm 2: Generate Latent Semantic As-
                                                                                           with a default topic value N +1.
 sociation Set of Word xi Using K-topic
                                                                                              All the basic NER models are trained on the
 LaSA Model
     1   Inputs:
                                                                                           domain-specific training data using RRM classifier
     2   • θs,t : LaSA model with multinomial distribution;                                (Guo et al., 2005). RRM is a generalization Winnow
     3   • Dirichlet(α): Dirichlet distribution with parameter α;
     4   • xi : Content word;                                                              learning algorithm (Zhang et al., 2002). We set the
     5   Outputs:
     6
     7
         • SA(xi ): Latent semantic association set of xi ;
         Steps:
                                                                                           context view window size as {-2,2} in NER. Given a
     8   begin                                                                             word instance x, we employ local linguistic features
     9           • Extract vdxi from the corpus.
    10           • Draw topic weights θs,t (vdxi ) from Dirichlet(α);                      (e.g. word unit, part of speech) of x and its context
    11           • foreach fj in vdxi do
    12                 draw a topic zj ∈{ 1,...,K} from Multinomial(θs,t (vdxi ));         units ( i.e. previous 2 words and next 2 words ) in
    13                 AddT o(zj , T opics(vdxi ));                                        NER. All Chinese texts in the experiments are auto-
    14         • Rank all the topics in T opics(vdxi );
                                                                                           matically segmented into words using HMM.
    15                        −
               • SA(xi ) ← top n topics in T opics(vdxi );
    16   end                                                                                  In LaSA-based domain adaptation, the semantic
                                                                                           association features of each unit in the observation
   LaSA model better models latent semantic asso-                                          window {-2,2} are generated by LaSA model at first,
ciation distribution in the source and the target do-                                      then the basic source domain NER model is tuned on
mains. By grouping words into concepts, we effec-                                          the original source domain training data set by incor-
tively overcome the data distribution difference of                                        porating the semantic association features. For ex-
both domains. Thus, we may reduce the number                                               ample, given the sentence “This popular new singer
of parameters required to model the target domain                                          attended the new year party”, Figure 1 illustrates
data, and improve the quality of the estimated pa-                                         various features and views at the current word wi =
rameters in the domain transfer. LaSA model ex-                                            “singer” in LaSA-based adaptation.
tends the traditional bag-of-words topic models to                                                                       →   Tagging   →
                                                                                            Position     wi−2          wi−1       wi           wi+1         wi+2
context-dependence concept association model. It                                              Word       popular       new        singer       attend       the
                                                                                               POS       adj           adj        noun         verb         article
has potential use for concept grouping.                                                          SA      SA(popular)   SA(new)    SA(singer)   SA(attend)   SA(the)
                                                                                                 .....
                                                                                                 Tag     ti−2          ti−1       ti
5        Experiments
We evaluate LaSA-based domain adaptation method                                              Figure 1: Feature window in LaSA-based adaptation
on both English and Chinese corpus in this section.
In the experiments, we focus on recognizing person                                            In the viewing window at the word “singer” (see
(PER), location (LOC) and organization (ORG) in                                            Figure 1), each word unit around “singer” is codi-
the given four domains, including economics (Eco),                                         fied with a set of primitive features (e.g. P OS, SA,
entertainment (Ent), politics (Pol) and sports (Spo).                                      T ag), together with its relative position to “singer”.

                                                                                     285
Here, “SA” denotes semantic association feature set       human annotated set were not available, we held out
which is generated by LaSA model. “T ag” denotes          more than 100,000 words of text from the automat-
NE tags labeled in the data set.                          ically tagged corpus to as a test set in each domain.
   Given the input vector constructed with the above      Table 2 shows the data distribution of the training
features, RRM method is then applied to train linear      and test data sets.
weight vectors, one for each possible class-label. In                   Domains      Training Data Set         Test Data Set
the decoding stage, the class with the maximum con-                     Pol
                                                                                      Size
                                                                                     0.45M
                                                                                                PERs
                                                                                                9,383
                                                                                                              Size
                                                                                                             0.23M
                                                                                                                        PERs
                                                                                                                        6,067
fidence is then selected for each token unit.                            Eco          1.06M     21,023        0.34M      6,951
                                                                        Spo          0.47M     17,727        0.20M      6,075
   In our evaluation, only NEs with correct bound-                      Ent          0.36M     12,821        0.15M      5,395

aries and correct class labels are considered as the
                                                                Table 2: English training and test data sets
correct recognition. We use the standard Precision
(P), Recall (R), and F-measure (F = P +R ) to mea-
                                        2P R
                                                            We also randomly select 17M unlabeled English
sure the performance of NER models.
                                                          data (see Table 3) from Wikipedia. These unlabeled
5.2 Data                                                  data are used to build the English LaSA model.
We built large-scale English and Chinese anno-                                         All                    Domain
                                                                                                 Pol       Eco     Spo          Ent
tated corpus. English corpus are generated from                    Data Size(M)      17.06       7.36      2.59    3.65         3.46

wikipedia while Chinese corpus are selected from
                                                          Table 3: Domain distribution in the unlabeled English
Chinese newspapers. Moreover, test data do not            data set
overlap with training data and unlabeled data.

5.2.1 Generate English Annotated Corpus                   5.2.2 Chinese Data
      from Wikipedia                                        We built a large-scale high-quality Chinese NE
   Wikipedia provides a variety of data resources for     annotated corpus. All the data are news articles from
NER and other NLP research (Richman and Schone,           several Chinese newspapers in 2001 and 2002. All
2008). We generate all the annotated English corpus       the NEs (i.e. PER, LOC and ORG ) in the corpus are
from wikipedia. With the limitation of efforts, only      manually tagged. Cross-validation checking is em-
PER NEs in the corpus are automatically tagged us-        ployed to ensure the quality of the annotated corpus.
ing an English person gazetteer. We automatically
                                                                  Domain      Size               NEs in the training data set
extract an English Person gazetteer from wikipedia                            (M)        PER         ORG           LOC           Total
                                                                  Pol         0.90      11,388       6,618        14,350        32,356
at first. Then we select the articles from wikipedia               Eco         1.40       6,821      18,827        14,332        39,980
                                                                  Spo         0.60      11,647       8,105         7,468        27,220
and tag them using this gazetteer.                                Ent         0.60      12,954       2,823         4,665        20,442
   In order to build the English Person gazetteer                 Domain      Size                 NEs in the test data set
                                                                              (M)       PER          ORG          LOC           Total
from wikipdedia, we manually selected several key                 Pol         0.20      2,470        1,528        2,540         6,538
                                                                  Eco         0.26      1,098        2,971        2,362         6,431
phrases, including “births”, “deaths”, “surname”,                 Spo         0.10      1,802        1,323        1,246         4,371
                                                                  Ent         0.10      2,458         526          738          3,722
“given names” and “human names” at first. For
each article title of interest, we extracted the cate-          Table 4: Chinese training and test data sets
gories to which that entry was assigned. The en-
try is considered as a person name if its related            All the domain-specific training and test data are
explicit category links contain any one of the key        selected from this annotated corpus according to the
phrases, such as “Category: human names”. We to-          domain categories (see Table 4). 8.46M unlabeled
tally extracted 25,219 person name candidates from        Chinese data (see Table 5) are randomly selected
204,882 wikipedia articles. And we expanded this          from this corpus to build the Chinese LaSA model.
gazetteer by adding the other available common
person names. Finally, we obtained a large-scale          5.3   Experimental Results
gazetteer of 51,253 person names.                         All the experiments are conducted on the above
   All the articles selected from wikipedia are further   large-scale English and Chinese corpus. The overall
tagged using the above large-scale gazetteer. Since       performance enhancement of NER by LaSA-based

                                                    286
                             All                 Domain                                Source →                 Performance in the domain transfer
                                     Pol      Eco     Spo          Ent                 Target
            Data Size(M)     8.46    2.34     1.99    2.08         2.05                           FBase     FLaSA         δ(F )       δ(loss)           FT op
                                                                                                                                                      in
                                                                                       Eco→Ent    60.45%     66.42%      +9.88%        26.29%        FEnt =83.16%
                                                                                                                                                      in
                                                                                       Pol→Ent    69.89%     73.07%      +4.55%        23.96%        FEnt =83.16%
Table 5: Domain distribution in the unlabeled Chinese                                                                                                 in
                                                                                       Spo→Ent    68.66%     70.89%      +3.25%        15.38%        FEnt =83.16%
data set                                                                               Ent→Eco    58.50%    61.35%      + 4.87%        11.98%         in
                                                                                                                                                     FEco =82.28%
                                                                                                                                                      in
                                                                                       Pol→Eco    62.89%    64.93%      +3.24%         10.52%        FEco =82.28%
                                                                                                                                                      in
                                                                                       Spo→Eco    60.44%    63.20%      + 4.57 %       12.64%        FEco =82.28%
domain adaptation is evaluated at first. Since the                                      Eco→Pol    67.03%    70.90 %      +5.77%        27.78%         in
                                                                                                                                                     FP ol =80.96%
                                                                                                                                                      in
distribution of each NE type is different across do-                                   Ent→Pol    66.64 %   68.94 %      +3.45%        16.06%        FP ol =80.96%
                                                                                                                                                      in
                                                                                       Spo→Pol    65.40%    67.20%       +2.75%        11.57%        FP ol =80.96%
mains, we also analyze the performance enhance-                                        Eco→Spo    67.20%    70.77%       +5.31%        15.47%         in
                                                                                                                                                     FSpo =90.24%
ment on each entity type by LaSA-based adaptation.                                     Ent→Spo    70.05%    72.20%       +3.07%        10.64%         in
                                                                                                                                                     FSpo =90.24%
                                                                                                                                                      in
                                                                                       Pol→Spo    70.99%    73.86%       +4.04%        14.91%        FSpo =90.24%
5.3.1 Performance Enhancement of NER by
        LaSA-based Domain Adaptation                                                     Table 7: Experimental results on Chinese corpus
   Table 6 and 7 show the experimental results for
all pairs of domain adaptation on both English and                                    cent points in this basic transfer. Significant perfor-
Chinese corpus, respectively. In the experiment,                                      mance degrading of Ms is observed in all the basic
the basic source domain NER model Ms is learned                                       transfer. It shows that the data distribution of both
from the specific domain training data set Ddom                                        domains is very different in each possible transfer.
(see Table 2 and 4 in Section 5.2). Here, dom ∈
                           in                                                            Experimental results on English corpus show that
{Eco, Ent, P ol, Spo}. Fdom denotes the top-line
                                                                                      LaSA-based adaptation effectively enhances the per-
F-measure of Ms in the source trained domain dom.
                                                                                      formance in each domain transfer (see Table 6).
When Ms is directly applied in a new target do-
                                                                                      For example, in the “Pol→Eco” transfer, FBase is
main, its F-measure in this basic transfer is consid-
                                                                                      63.62% while FLaSA achieves 68.10%. Compared
ered as baseline (denoted by FBase ). FLaSA de-
                                                                                      with FBase , LaSA-based method significantly en-
notes F-measure of Ms achieved in the target do-
                                                                                      hances F-measure by 7.04%. We perform t-tests on
main with LaSA-based domain adaptation. δ(F ) =
FLaSA −FBase                                                                          F-measure of all the comparison experiments on En-
     FBase    , which denotes the relative F-measure
                                                                                      glish corpus. The p-value is 2.44E-06, which shows
enhancement by LaSA-based domain adaptation.
                                                                                      that the improvement is statistically significant.
 Source →                     Performance in the domain transfer
 Target                                                                                  Table 6 also gives the accuracy loss due to transfer
             FBase         FLaSA        δ(F )       δ(loss)            FT op
 Eco→Ent     57.61%         59.22%     +2.79%        17.87%          in
                                                                    FEnt =66.62%
                                                                                      in each domain adaptation on English corpus. The
 Pol→Ent     57.5 %         59.83%     +4.05%        25.55%          in
                                                                    FEnt =66.62%      accuracy loss is defined as loss = 1 − F in . And
                                                                                                                                   F
                                                                     in                                                                               dom
 Spo→Ent     58.66%         62.46%     +6.48%        47.74%         FEnt =66.62%
 Ent→Eco     70.56 %       72.46%      +2.69%       19.33%           in
                                                                    FEco =80.39%      the relative reduction in error is defined as δ(loss)=
 Pol→Eco     63.62%         68.1%      +7.04%       26.71%           in
                                                                    FEco =80.39%
                                                                     in
                                                                                      |1 − lossLaSA |. Experimental results indicate that
                                                                                             lossBase
 Spo→Eco     70.35%        72.85%      +3.55%       24.90%          FEco =80.39%
 Eco→Pol      50.59%        52.7%      +4.17%       15.81%           in
                                                                    FP ol =63.94%
                                                                                      the relative reduction in error is above 9.93% with
                                                                     in
 Ent→Pol      56.12%       59.82%      +6.59%       47.31%          FP ol =63.94%
                                                                     in
                                                                                      LaSA-based transfer in each test on English cor-
 Spo→Pol      60.22%        62.6%      +3.95%       63.98%          FP ol =63.94%
 Eco→Spo      60.28%       61.21%      +1.54%        9.93%           in
                                                                    FSpo =69.65%
                                                                                      pus. LaSA model significantly decreases the ac-
 Ent→Spo      60.28%       62.68%      +3.98%       25.61%           in
                                                                    FSpo =69.65%      curacy loss by 29.38% in average. Especially for
                                                                     in
 Pol→Spo      56.94%       60.48%      +6.22%       27.85%          FSpo =69.65%
                                                                                      “Spo→Pol” transfer, δ(loss) achieves 63.98% with
   Table 6: Experimental results on English corpus                                    LaSA-based adaptation. All the above results show
                                                                                      that LaSA-based adaptation significantly reduces the
  Experimental results on English and Chinese cor-                                    accuracy loss in the domain transfer for English
pus indicate that the performance of Ms signifi-                                       NER without any labeled target domain samples.
cantly degrades in each basic domain transfer with-                                      Experimental results on Chinese corpus also show
out using LaSA model (see Table 6 and 7). For ex-                                     that LaSA-based adaptation effectively increases the
ample, in the “Eco→Ent” transfer on Chinese cor-                                      accuracy in all the tests (see Table 7). For example,
                    in
pus (see Table 7), Feco of Ms is 82.28% while FBase                                   in the “Eco→Ent” transfer, compared with FBase ,
of Ms is 60.45% in the entertainment domain. F-                                       LaSA-based adaptation significantly increases F-
measure of Ms significantly degrades by 21.83 per-                                     measure by 9.88%. We also perform t-tests on F-

                                                                                287
measure of 12 comparison experiments on Chinese
corpus. The p-value is 1.99E-06, which shows that
the enhancement is statistically significant. More-
over, the relative reduction in error is above 10%
with LaSA-based method in each test. LaSA model
decreases the accuracy loss by 16.43% in average.
Especially for the “Eco→Ent” transfer (see Table 7),
δ(loss) achieves 26.29% with LaSA-based method.
   All the above experimental results on English and
Chinese corpus show that LaSA-based domain adap-
tation significantly decreases the accuracy loss in the
transfer without any labeled target domain data. Al-     Figure 2: PER, LOC and ORG recognition in the transfer
though automatically tagging introduced some er-
rors in English source training data, the relative re-
                                                         model better groups various titles from different do-
duction in errors in English NER adaptation seems
                                                         mains (see Table 1 in Section 4.2). Various industry
comparable to that one in Chinese NER adaptation.
                                                         terms in ORG NEs are also grouped into the seman-
5.3.2 Accuracy Enhancement for Each NE                   tic sets. These semantic associations provide useful
      Type Recognition                                   hints for detecting the boundary of NEs in the new
                                                         target domain. All the above results show that LaSA
   Our statistic data (Guo et al., 2006) show that the
                                                         model better compensates for the feature distribution
distribution of NE types varies with domains. Each
                                                         difference of each NE type across domains.
NE type has different domain features. Thus, the
performance stability of each NE type recognition is
                                                         6   Conclusion
very important in the domain transfer.
   Figure 2 gives F-measure of each NE type recog-       We present a domain adaptation method with LaSA
nition achieved by LaSA-based adaptation on En-          model in this paper. LaSA model captures latent se-
glish and Chinese corpus. Experimental results           mantic association among words from the unlabeled
show that LaSA-based adaptation effectively in-          corpus. It better groups words into a set of concepts
creases the accuracy of each NE type recognition in      according to the related context snippets. LaSA-
the most of the domain transfer tests. We perform        based domain adaptation method projects words to
t-tests on F-measure of the comparison experiments       a low-dimension concept feature space in the trans-
on each NE type, respectively. All the p-value is        fer. It effectively overcomes the data distribution gap
less than 0.01, which shows that the improvement         across domains without using any labeled target do-
on each NE type recognition is statistically signifi-     main data. Experimental results on English and Chi-
cant. Especially, the p-value of English and Chinese     nese corpus show that LaSA-based domain adapta-
PER is 2.44E-06 and 9.43E-05, respectively, which        tion significantly enhances the performance of NER
shows that the improvement on PER recognition is         across domains. Especially, LaSA model effectively
very significant. For example, in the “Eco→Pol”           increases the accuracy of each NE type recogni-
transfer on Chinese corpus, compared with FBase ,        tion in the domain transfer. Moreover, LaSA-based
LaSA-based adaptation enhances F-measure of PER          domain adaptation method works well across lan-
recognition by 9.53 percent points. Performance en-      guages. To further reduce the accuracy loss, we will
hancement for ORG recognition is less than that one      explore informative sampling to capture fine-grained
for PER and LOC recognition using LaSA model             data difference in the domain transfer.
since ORG NEs usually contain much more domain-
specific information than PER and LOC.
   The major reason for error reduction is that exter-   References
nal context and internal units are better semantically   Rie Ando and Tong Zhang. 2005. A Framework for
associated using LaSA model. For example, LaSA             Learning Predictive Structures from Multiple Tasks

                                                   288
  and Unlabeled Data. In Journal of Machine Learning        Scott Miller, Jethran Guinness, and Alex Zamanian.
  Research 6 (2005), pages 1817–1853.                          2004. Name Tagging with Word Clusters and Discrim-
Andrew Arnold, Ramesh Nallapati, and William W. Co-            inative Training. In Proceedings of HLT-NAACL 04.
  hen. 2008. Exploiting Feature Hierarchy for Trans-        Jianfeng Gao, Mu Li, Anndy Wu, and Changning Huang.
  fer Learning in Named Entity Recognition. In Pro-            2005. Chinese Word Segmentation and Named Entity
  ceedings of 46th Annual Meeting of the Association of        Recognition: A Pragmatic Approach. Computational
  Computational Linguistics (ACL’08), pages 245-253.           Linguisitc, 31(4):531–574.
David Blei, Andrew Ng, and Michael Jordan. 2003. La-        Honglei Guo, Jianmin Jiang, Gang Hu, and Tong Zhang.
  tent Dirichlet Allocation. Journal of Machine Learn-         2005. Chinese Named Entity Recognition Based on
  ing Research, 3:993–1022.                                    Multilevel Linguistic Features. In Lecture Notes in Ar-
John Blitzer, Ryan McDonald, and Fernando Pereira.             tificial Intelligence, 3248:90–99.
  2006. Domain Adaptation with Structural Correspon-        Honglei Guo, Li Zhang, and Zhong Su. 2006. Empirical
  dence Learning. In Proceedings of the 2006 Confer-           Study on the Performance Stability of Named Entity
  ence on Empirical Methods in Natural Language Pro-           Recognition Model across Domains. In Proceedings
  cessing (EMNLP 2006), pages 120-128.                         of the 2006 Conference on Empirical Methods in Nat-
John Blitzer, Mark Dredze, and Fernando Pereira. 2007.         ural Language Processing (EMNLP 2006), pages 509-
  Biographies, Bollywood, Boom-boxes and Blenders:             516.
  Domain Adaptation for Sentiment Classification. In         Thomas Hofmann. 1999. Probabilistic latent semantic
  Proceedings of the 45th Annual Meeting of the Asso-          indexing. In Proceedings of the 22th Annual Inter-
  ciation of Computational Linguistics (ACL’07), pages         national SIGIR Conference on Research and Develop-
  440-447.                                                     ment in Information Retrieval (SIGIR’99).
Andrew Borthwick. 1999. A Maximum Entropy Ap-               Jing Jiang and ChengXiang Zhai. 2006. Exploiting Do-
  proach to Named Entity Recognition. Ph.D. thesis,            main Structure for Named Entity Recognition. In Pro-
  New York University.                                         ceedings of HLT-NAACL 2006, pages 74–81.
Yee Seng Chan and Hwee Tou Ng. 2007. Domain Adap-           Jing Jiang and ChengXiang Zhai. 2007. Instance
  tation with Active Learning for Word Sense Disam-            Weighting for Domain Adaptation in NLP. In Pro-
  biguation. In Proceedings of the 45th Annual Meet-           ceedings of the 45th Annual Meeting of the Associ-
  ing of the Association of Computational Linguistics          ation of Computational Linguistics (ACL’07), pages
  (ACL’07).                                                    264–271.
Ciprian Chelba and Alex Acero. 2004. Adaptation of          Wei Li and Andrew McCallum. 2005. Semi-supervised
  maximum entropy capitalizer: Little data can help a          sequence modeling with syntactic topic models. In
  lot. In Proceedings of the 2004 Conference on Empir-         Proceedings of Twenty AAAI Conference on Artificial
  ical Methods in Natural Language Processing.                 Intelligence (AAAI-05).
Kenneth Ward Church and Patrick Hanks. 1990. Word           Alexander E. Richman and Patrick Schone. 2008. Min-
  association norms, mutual information and lexicogra-         ing Wiki Resources for Multilingual Named Entity
  phy. Computational Linguistics, 16(1):22–29.                 Recognition. In Proceedings of the 46th Annual Meet-
Hal Daume III. 2007. Frustratingly Easy Domain Adap-           ing of the Association of Computational Linguistics.
  tation. In Proceedings of the 45th Annual Meeting of      Brian Roark and Michiel Bacchiani. 2003. Supervised
  the Association of Computational Linguistics.                and unsupervised PCFG adaptation to novel domains.
Hal Daume III and Daniel Marcu. 2006. Domain adap-             In Proceedings of the 2003 Human Language Technol-
  tation for statistical classifiers. Journal of Artificial      ogy Conference of the North American Chapter of the
  Intelligence Research, 26:101–126.                           Association for Computational Linguistics.
Scott Deerwester, Susan T. Dumais, and Richard Harsh-       Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
  man. 1990. Indexing by latent semantic analysis.             Introduction to the conll-2003 shared task: Language
  Journal of the American Society for Information Sci-         independent named entity recognition. In Proceed-
  ence, 41(6):391–407.                                         ings of the 2003 Conference on Computational Natural
Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong          Language Learning (CoNLL-2003), pages 142–147.
  Zhang. 2003. Named entity recogintion through clas-       Xing Wei and Bruce Croft. 2006. LDA-based document
  sifier combination. In Proceedings of the 2003 Confer-        models for ad-hoc retrieval. In Proceedings of the 29th
  ence on Computational Natural Language Learning.             Annual International SIGIR Conference on Research
Freitag. 2004. Trained Named Entity Recognition Using          and Development in Information Retrieval.
  Distributional Clusters. In Proceedings of the 2004       Tong Zhang, Fred Damerau, and David Johnson. 2002
  Conference on Empirical Methods in Natural Lan-              Text chunking based on a generalization of Winnow.
  guage Processing (EMNLP 2004).                               Journal of Machine Learning Research, 2:615–637.

                                                      289

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/2/2011
language:English
pages:9