A Complementary Bioinformatics Approach to Identify Potential Plant

Document Sample
A Complementary Bioinformatics Approach to Identify Potential Plant Powered By Docstoc
					       This article is published in Plant Physiology Online, Plant Physiology Preview Section, which publishes manuscripts accepted for
       publication after they have been edited and the authors have corrected proofs, but before the final, complete issue is published
       online. Early posting of articles reduces normal time to publication by several weeks.



A Complementary Bioinformatics Approach
to Identify Potential Plant Cell Wall
Glycosyltransferase-Encoding Genes1[w]

Jack Egelund, Michael Skjøt, Naomi Geshi, Peter Ulvskov, and Bent Larsen Petersen*
Biotechnology Group, Danish Institute of Agricultural Sciences, DK–1871 Frederiksberg C, Copenhagen,
Denmark; and Center for Molecular Plant Physiology (PlaCe), The Royal Veterinary and Agricultural
University, DK–1871 Frederiksberg C, Copenhagen, Denmark


Plant cell wall (CW) synthesizing enzymes can be divided into the glycan (i.e. cellulose and callose) synthases, which are
multimembrane spanning proteins located at the plasma membrane, and the glycosyltransferases (GTs), which are Golgi
localized single membrane spanning proteins, believed to participate in the synthesis of hemicellulose, pectin, mannans, and
various glycoproteins. At the Carbohydrate-Active enZYmes (CAZy) database where e.g. glucoside hydrolases and GTs are
classified into gene families primarily based on amino acid sequence similarities, 415 Arabidopsis GTs have been classified into
GT families. Although much is known with regard to composition and fine structures of the plant CW, only a handful of CW
biosynthetic GT genes—all classified in the CAZy system—have been characterized. In an effort to identify CW GTs that have
not yet been classified in the CAZy database, a simple bioinformatics approach was adopted. First, the entire Arabidopsis
proteome was run through the Transmembrane Hidden Markov Model 2.0 server and proteins containing one or, more rarely,
two transmembrane domains within the N-terminal 150 amino acids were collected. Second, these sequences were submitted
to the SUPERFAMILY prediction server, and sequences that were predicted to belong to the superfamilies NDP-sugar-
transferase, UDP-glycosyltransferase/glucogen-phosphorylase, carbohydrate-binding domain, Gal-binding domain, or
Rossman fold were collected, yielding a total of 191 sequences. Fifty-two accessions already classified in CAZy were discarded.
The resulting 139 sequences were then analyzed using the Three-Dimensional-Position-Specific Scoring Matrix and
mGenTHREADER servers, and 27 sequences with similarity to either the GT-A or the GT-B fold were obtained. Proof of
concept of the present approach has to some extent been provided by our recent demonstration that two members of this pool
of 27 non-CAZy-classified putative GTs are xylosyltransferases involved in synthesis of pectin rhamnogalacturonan II
(J. Egelund, B.L. Petersen, A. Faik, M.S. Motawia, C.E. Olsen, T. Ishii, H. Clausen, P. Ulvskov, and N. Geshi, unpublished data).


   The plant cell wall (CW) consists of four major                        Staehelin, 1992; Sherrier and VandenBosch, 1994),
polysaccharide components, namely cellulose, callose,                     from which the polysaccharides are secreted into the
hemicellulose, and pectin. CW synthesis/formation                         wall where they undergo further modifications (Fry,
can be divided into three major steps. (1) Initially, the                 1995). (3) The final step, which constitutes the assem-
various building blocks in the form of activated                          bly of the various polysaccharide structures into the
glycosyl residues (NDP-sugars) are synthesized via                        wall, remains in large part a mystery. However,
two different pathways—the nucleotide interconver-                        self-assembly of wall components most likely plays
sion pathway or the salvage pathway (for overview,                        a role (for discussion of a possible mechanism, see
see Carpita, 1996). The synthesis of the NDP-sugars                       MacDougal et al., 1997), and both enzymatic and non-
may occur in the cytosol and/or the Golgi apparatus                       enzymatic mechanisms as well as arabinogalactan
depending on the type of NDP-sugar synthesized                            proteins and other wall structural proteins (Cosgrove,
(Mohnen, 1999). (2) The synthesized nucleotide sugars                     1997) participate in the complex process.
are then assembled into higher-order polysaccharide                          The noncellulosic polymers hemicellulose and pec-
structures. Apart from cellulose and callose, biosyn-                     tin are synthesized by glycosyltransferases (GTs) pre-
thesis of CW polysaccharides occurs in the endomem-                       sumably located in the different compartments of the
brane system (Bolwell and Northcote, 1983; Zhang and                      Golgi apparatus. These GTs are believed to be type II
                                                                          membrane-bound proteins with the catalytic domain
                                                                          (C-terminal) facing the lumen of the Golgi apparatus
                                                                          (Ridley et al., 2001; Sterling et al., 2001; Geshi et al.,
  1                                                                       2004).
    This work was supported by the Danish National Research
                                                                             Although the GTs, for which the three-dimensional
Foundation and The Danish Research Agency.
  * Corresponding author; e-mail b.petersen@dias.kvl.dk; fax 45–
                                                                          (3D) structures have been resolved, exhibit insignifi-
35282589.                                                                 cant or at the best very low sequence similarity, they
  [w]
      The online version of this article contains Web-only data.          adopt one of following folds at the 3D-structure level:
  Article, publication date, and citation information can be found at     the GT-A (SpsA and SpsA-like) fold or the GT-B (B-GT
www.plantphysiol.org/cgi/doi/10.1104/pp.104.042978.                       and B-GT-like) fold (Bourne and Henrissat, 2001; Hu
Plant Physiology, September 2004, Vol. 136, pp. 1–12, www.plantphysiol.org Ó 2004 American Society of Plant Biologists                    1 of 12
Egelund et al.


and Walker, 2002; Coutinho et al., 2003; Wimmerova
et al., 2003).
   The Carbohydrate-Active enZYme (CAZy; http://
afmb.cnrs-mrs.fr/CAZY/) database is a versatile and
comprehensive database of sequence-based carbohy-
drate enzymes, where e.g. glucoside hydrolases and
GTs are classified into families primarily based on amino
acid sequence similarities (Henrissat et al., 2001).
Within a given family, the 3D structure is conserved,
i.e. the same 3D fold is expected to occur in each
family (Coutinho et al., 2003).
   Although composition of the major CW polysac-
charides is reasonably well described (Carpita et al.,
2001), only a handful of the biosynthetic genes have
been identified. All of the seven known GTs, i.e. with
proven or putative function in mannan (Edwards et al.,
1999), hemicellulose (Perrin et al., 1999; Faik et al.,
2002; Madson et al., 2003), and pectin synthesis
(Bouton et al., 2002; Iwai et al., 2002; J. Sterling and
D. Mohnen, personal communication), are classified in
the CAZy database. In this study, we have set up an
alternative bioinformatics scheme aimed at identifying
CW GTs with a predicted type II membrane topology,
which are not classified in the CAZy database. Using
this alternative approach, 27 non-CAZy classified
accessions with a predicted N-terminal transmem-
brane domain (TMD) typical of type II membrane               Figure 1. Flow chart of the bioinformatics approach used to identify 27
proteins and that were predicted to adopt the GT-A or        putative GTs not classified in the CAZy database. Web sites for the
the GT-B fold were identified.                                various servers used in this study are listed in ‘‘Materials and Methods.’’



RESULTS                                                      leaving a total of 139 sequences (24, 25, 12, 5, and 73 from
                                                             the 5 superfamilies, respectively), which were not classi-
   In an effort to obtain GTs with a type II membrane        fied in the CAZy database. The 139 sequences were then
protein topology, which have not been classified in the       run through the mGenTHREADER and 3D-Position-
CAZy database, the following simple bioinformatics           Specific Scoring Matrix (3D-PSSM) servers, respectively.
approach was adopted (for overview, see Fig. 1).             A local set of protein IDs (Protein Data Bank [PDB]) of
   First, using the Transmembrane Hidden Markov              proteins, whose 3D structures have been resolved and
Model (TMHMM) 2.0 prediction server, the entire              which adopt either the GT-A or the GT-B fold (Table I;
Arabidopsis proteome (26,095 proteins) was scanned           references for the PDB IDs can be retrieved at http://
for the presence of transmembrane helices, yielding          www.RCSB.ORG/), and resolved 3D structures derived
a total of 5,977 sequences with any number of pre-           from the CAZy database GT families were used to vali-
dicted transmembrane helices. Within this pool, po-          date the output from each of the two servers. Twenty-
tential type II membrane proteins with either one or, in     seven of the 139 sequences (Table II) displayed similarity
rare cases, two (derived from the predicted trans-           to one or more of the entries in Table I, i.e. the proteins
membrane helix and a hydrophobic signal peptide)             predicted to adopt either the GT-A or the GT-B fold.
predicted TMDs, which resided within the first 150            Recently, two highly identical accessions (Q9ZSJ2 and
amino acids from the N terminus, were identified and          Q9ZSJ0; Table II; Fig. 2B) were shown to be CW-specific
extracted, yielding a total of 2,248 and 363 accessions,     xylosyltransferases (J. Egelund, B.L. Petersen, A. Faik,
respectively.                                                M.S. Motawia, C.E. Olsen, T. Ishii, H. Clausen, P. Ulvskov,
   The 2,248 plus 363 sequences were then submitted          and N. Geshi, unpublished data), corroborating that the
to the SUPERFAMILY prediction server, and 191 se-            adopted bioinformatics strategy identifies GTs related to
quences predicted, indiscriminately of E-value, to be-       CW biosynthesis.
long to the superfamilies NDP-sugartransferase (54),
UDP-glycosyltransferase/glucogen-phosphorylase (33),
Gal-binding domain (23), carbohydrate-binding domain         Filtering of the Arabidopsis Proteome
(6), or the GT-B-similar Rossman fold (75) were collected.
The 191 sequences were then blasted against the CAZy           Choice of servers, strategies applied, and theoretical
database (September 10, 2003), and sequences found in        and practical considerations of the filtering process are
the CAZy database were removed from the dataset,             described sequentially below.
2 of 12                                                                                              Plant Physiol. Vol. 136, 2004
                                                                                Alternative Source of Putative Cell Wall Glycosyltransferases



Table I. List of PDB IDs used to screen the result of the secondary structure prediction servers mGenTHREADER and 3D-PSSM
  The PDB IDs were obtained from Wimmerova et al. (2003; lowercase) as well as manually from the CAZy database. GT families (uppercase; only
one PDB ID per family). Origins and functions of the enzymes were obtained from the PDB (http://www.pdb.mdc-berlin.de/pdb/; Berman et al., 2000).
    PDB ID                        Origin                               Enzyme                                         Function
   GT-A
    1ABB            Rabbit                              Glycogen phosphorylase                        Glycogen biosynthesis
    1EM6            Human                               Glycogen phosphorylase                        Glycogen biosynthesis
    1eyr            N. meningitidis                     Selenomethionyl                               Activation of sialic acid
                                                          cytosine-5’-
                                                          monophosphate-
                                                          acylneuraminate synthetase
     1fgg           Human                               Glucuronyltransferase l                       Heparan/chondroitin sulfate
                                                                                                        biosynthesis
     1foa           Rabbit                              N-Acetylglucosaminyltransferase l             Decoration of glycoproteins
     1fr8           Bovine                              b-1,4-Galactosyltransferase                   Decoration of glycoproteins
     1frw           E. coli                             MobA                                          Molybdopterin guanine
                                                                                                        dinucleotide biosynthesis
     1g0r           Pseudomonas aeruginosa              Glucose-1-phosphate                           Bacterial cell wall biosynthesis
                                                          thymidylyltransferase
     1g93           Bovine                              a-1,3-Galactosyltransferase                   Decoration of glycoproteins
     1g97           Streptococcus pneumoniae            N-Acetylglucosamine-1-                        Synthesis of UDP-N-acetylglucosamine
                                                          phosphate uridyltransferase
     1ga8           N. meningitidis                     Lipopolysaccharide                            Lipooligosaccharide biosynthesis
                                                          galactosyltransferase
     1GA8           N. meningitidis                     Lipopolysaccharide                            Lipooligosaccharide biosynthesis
                                                          galactosyltransferase
                                                          implicated in
     1GZ5           E. coli                             Trehalose phosphate synthase                  Trehalose biosynthesis
     1h7g           E. coli                             3-Deoxy-manno-octulosonate                    Lipopolysaccharide biosynthesis
                                                          cytidylyltransferase
     1ini           E. coli                             4-Diphosphocytidyl-2-                         Isoprenoid biosynthesis
                                                          C-methylerythritol
     1j94           Mouse                               b-1,4-Galactosyltransferase                   Lactose biosynthesis
     1ll2           Rabbit                              Glycogenin glucosyltransferase                Glycogen biosynthesis
     1Iz0           Human                               Glycosyltransferase A                         Blood group biosynthesis
     1LZJ           Human                               a-1/3Galactosyltransferase                    Blood group biosynthesis
     1OMX           Mouse                               a-1,4-N-Acetylhexosaminyltransferase          Heparan sulfate biosynthesis
     1qgq           Bacillus subtilis                   NDP-sugartransferase                          Synthesis of spore coat
     1YGP           Saccharomyces cerevisiae            Glycogen phosphorylase                        Glycogen biosynthesis
   GT-B
    1BGT            Bacteriophage T4                    b-Glucosyltransferase                         Nucleotide synthesis
    1c3j            Bacteriophage T4                    b-Glucosyltransferase                         Nucleotide synthesis
    1f0k            E. coli                             Pyrophosphoryl-undecaprenol                   Peptidoglycan biosynthesis
                                                          N-acetylglucosamine transferase
     1f6d           E. coli                             Udp-N-acetylglucosamine                       UDP-N-acetylglucosamine
                                                          2-epimerase                                   biosynthesis
     1FGG           Human                               Glucuronyltransferase l                       Heparan/chondroitin sulfate synthesis
     1FGX           Bovine                              b-1,4-Galactosyltransferase                   Glycoprotein and
                                                                                                        glycosphingolipid synthesis
     1FO9           Rabbit                              N-Acetylglucosaminyltransferase l             Decoration of glycoproteins
     1h5u           Rabbit                              Glycogen phosphorylase                        Glycogen biosynthesis
     1iir           Amycolatopsis orientalis            Udp-glucosyltransferase Gtfb                  Synthesis of the Vancomycin
                                                                                                        group of antibiotics
     1NLM           E. coli                             Pyrophosphoryl-undecaprenol                   Bacterial cell wall synthesis
                                                          N-acetylglucosamine transferase
     1PN3           A. orientalis                       Tdp-Epi-                                      Synthesis of the Vancomycin
                                                          Vancosaminyltransferase Gtfa                  group of antibiotics
     1QG8           B. subtilis                         Nucleotide-diphospho-                         Spore coat synthesis
                                                          sugartransferase
     1QKJ           Bacteriophage T4                    b-Glucosyltransferase                         Nucleotide synthesis
     1qm5           E. coli                             Maltodextrin phosphorylase                    Phosphorolysis of maltodextrin




Plant Physiol. Vol. 136, 2004                                                                                                            3 of 12
Egelund et al.



Table II. The 27 putative GTs identified as a result of filtering the Arabidopsis proteome through the TMHMM, SUPERFAMILY, mGenTHREADER, and
3D-PSSM servers as illustrated in Figure 1
  Web sites for the various servers used in this study are listed in ‘‘Material and Methods.’’
                                                                                             E-Value
      TrEMBL Protein ID     SuperFa         Best Fit to Known GT Foldb                                                    BLAST (NCBI)                     TMD
                                                                              3D-PSSM             mGenTHREADER

         Q9LZ77                U                      GT-B                    0.0017                    0.001          Plant and bacteria                  51-70
         Q9M147                U                      GT-B                    0.0112                    0.001          Plant and bacteria                  44-66
         Q9FMW3                U                      GT-B                   Not found                  0.023                 Plant                        53-72
         Q9LU22                U                      GT-B                   Not found                  0.068          Plant and animal                    27-49
         Q9C9Z9                N                      GT-B                   Not found                  0.022          Plant and animal                    27-49
         O81786                R                      GT-B                   Not found                  0.979          Plant and animal                    45-62
         Q9C920                N                      GT-A                   3.56e208                   0.005                 Plant                        13-35
         Q9LTZ5                N                      GT-A                   8.51e205                   0.004          Plant and bacteria                  21-42
         Q9FM26                N                      GT-A                    0.0173                    0.005          Plant and bacteria                  21-43
         O04568                N                      GT-A                    0.00930                   0.009          Plant and bacteria                  22-44
         Q9FXA7                N                      GT-A                    0.606                     0.112                 Plant                        21-40
         Q9C9Q6                N                      GT-A                    0.063                     0.017                 Plant                        13-35
         Q9C9Q5                N                      GT-A                    0.0993                    0.022                 Plant                        13-35
         Q9FMN8                N                      GT-A                    0.115                     0.015                 Plant                        26-45
          Q9ZSJ2               N                      GT-A                    0.271                     0.037                 Plant                        36-55
         Q9FF50                N                      GT-A                    0.278                     0.013                 Plant                        38-60
         Q9M146                N                      GT-A                    0.0355                    0.059                 Plant                        42-61
         Q9SZU2                N                      GT-A                    0.0355                    0.012                 Plant                        44-66
          Q9ZSJ0               N                      GT-A                    0.174                     0.111                 Plant                        30-52
         Q9SAD6                N                      GT-A                   Not found                   0,046                Plant                        20-42
         Q9LKU7                R                      GT-A                   Not found                  1.030                 Plant                        23-45
         Q9LQS0                R                      GT-A                   Not found                  0.774                 Plant                        45-67
          Q9LYF7               R                      GT-A                    4.11                     Not found              Plant                        30-52
         Q9LU27                R                      GT-A                    3.51                     Not found              Plant                        31-53
         Q9T0G0                R                      GT-A                    0.660                    Not found       Plant and bacteria                   7-29
                                                                              2TMD Proteins
         Q9XEE9                U                      GT-A                   2.73e205                   0.0008          Plant and animal           4-26 and 113-135
         Q9ZU10                R                      GT-B                   Not found                  0.551           Plant and animal            5-27 and 47-69
                              Length                                                 Pfamc                         DxD in Hydrophobic         Isoxaben Array
      TrEMBL Protein ID                          SignalP                                                                                                           EST
                            Amino Acids                                                                                  Pocketd            Up/Down-Regulated
                                                                           Domain                       E-Value

          Q9LZ77               1091          Nonsecretory                   GT1                   0.00017                Yes                   137%                No
          Q9M147                963          Signal anchor                  GT1                   3.9 3 1027             Yes                   135%                Yes
          Q9FMW3                559          Signal anchor                 None                   Not found           Not found                222%                No
          Q9LU22                419          Signal anchor                 None                   Not found              Yes                   22%                 Yes
          Q9C9Z9                533          Signal peptide                None                   Not found           Not found                24%                 Yes
          O81786                204          Signal anchor                 None                   Not found           Not found                218%                Yes
          Q9C920                290          Signal peptide               CTP-GT                  2.9 3 10263            Yes                   152%                Yes
          Q9LTZ5                582          Signal anchor                  GT2                   0.0016                 Yes                   1483%               Yes
          Q9FM26                583          Signal anchor                  GT2                   0.0088                 Yes                   Not found           Yes
          O04568                516          Signal anchor                 None                   Not found              Yes                   238%                Yes
          Q9FXA7                383          Signal anchor                 None                   Not found              Yes                   Not found           Yes
          Q9C9Q6                402          Signal anchor                  GT8                   0.017                  Yes                   Not found           Yes
          Q9C9Q5                428          Signal anchor                  GT8                   0.09                   Yes                   21%                 Yes
          Q9FMN8                624          Signal anchor                 None                   Not found              Yes                   144%                Yes
           Q9ZSJ2               361          Signal anchor                 None                   Not found              Yes                   Not found           No
           Q9FF50               932          Signal anchor                  GT2                   Not found              Yes                   Not found           Yes
          Q9M146                360          Signal anchor                 None                   Not found              Yes                   159%                Yes
          Q9SZU2                588          Signal anchor                  GT2                   0.011                  Yes                   Not found           No
           Q9ZSJ0               367          Signal anchor                  GT2                   0.011                  Yes                   264%                Yes
          Q9SAD6                371          Signal anchor         Chemotaxis phosphatase         0.075                  Yes                   262%                Yes
          Q9LKU7                156          Nonsecretory            Zinc finger domain            0.00044             Not found                225%                No
          Q9LQS0                118          Nonsecretory                  None                   Not found           Not found                Not found           No
           Q9LYF7               386          Signal anchor                 None                   Not found           Not found                220%                Yes
          Q9LU27                384          Signal anchor                 None                   Not found              Yes                   228%                No
          Q9T0G0                389          Signal peptide            Dehydrogenase              3.7 3 10221         Not found                12%                 Yes
                                                                              2TMD Proteins
                                 474         Signal peptide                 GT1                   2.1 3 10219            Yes                   222%                Yes
                                 200         Signal peptide                 None                  Not found           Not found                Not found           Yes
  a
    N, NDP-sugartransferases; R, NAD(P)-binding Rossmann-fold domains; U, UDP-glycosyltransferase/glycogen phosphorylase; G, galactose-binding domain-
        b                                      c                                    d
like.    3D-PSSM and/or mGenTHREADER.           Hits only shown for E-values , 0.1.  HCA analysis.




4 of 12                                                                                                                            Plant Physiol. Vol. 136, 2004
                                                                              Alternative Source of Putative Cell Wall Glycosyltransferases




         Figure 2. Phylogenetic tree of the 27 putative GTs. Four distinct homologous groups (A–D) consisting of two to six sequences
         were identified in the analysis.



Filter I: Identification of Potential Type II Membrane                     accessions. The SUPERFAMILY database contains a li-
Proteins                                                                  brary of hidden Markov models (HMMs) representing
                                                                          all proteins of known structure (Gough et al., 2001;
   In two comparative tests, TMHMM 2.0 (Krogh et al.,                     Gough and Chothia, 2002). The SUPERFAMILY facility
2001) was found to be the best of the tested prediction                   is based on the Structural Classification of Proteins
servers measured as having the lowest fraction of the                     (SCOP) protein domain classification database, which
sum of false positives and false negatives within the                     in turn is based on multiple sequence alignments
total number of the experimentally assigned trans-                        designed to represent a protein family in a structural
membrane helices (TMH) segments ([false positives 1                       domain-based hierarchical classification scheme with
false negatives]/no. of TMH) used in the tests                            several levels, including the superfamily level (Murzin
(Schwacke et al., 2003; Zhou and Zhou, 2003).                             et al., 1995).
TMHMM 2.0 was chosen as the initial filter because
of its reliable and somewhat conservative prediction
strategy and because this server supports batch sub-
                                                                          Filter IV: Identification of Putative GTs within
missions of up to 4,000 accessions.
                                                                          GT-Containing Superfamilies

Filters II and III: Identification of Accessions in                           mGenTHREADER is based upon a multilayered
GT-Containing Superfamilies                                               neural network that was trained to combine sequence
                                                                          alignment score, length information, and energy po-
   The SUPERFAMILY database server was chosen as                          tentials with PSI-BLAST searches, which have been
the next filter because this facility incorporates an                      jumpstarted with structural alignment profiles from
alternative approach to the seed-based PSI-blast ap-                      Fold Secondary Structure Prediction and Predict
proach used in the CAZy database classification                            Secondary Structure (PSIPRED), predicted second-
scheme and supports batch submissions of up to 20                         ary structure, and bidirectional scoring in order to
Plant Physiol. Vol. 136, 2004                                                                                                           5 of 12
Egelund et al.


calculate the final alignment score (Jones, 1999;                                  quite different proteins: (1) a putative membrane-
McGuffin and Jones, 2003).                                                         bound Arabidopsis protein involved in synthesis of
   3D-PSSM constitutes a method for protein fold                                  UDP-D-Xyl that in plants is incorporated in glycopro-
recognition using one-dimensional (1D) and 3D se-                                 teins and CW polysaccharides, including xyloglucan
quence profiles coupled with secondary structure and                               (XG), and (2) an Escherichia coli protein catalyzing
solvation potential information (Kelley et al., 2000).                            the epimerization of UDP-N-acetyl-D-glucosamine to
   The output of the sequence-based SUPERFAMILY                                   UDP-N-acetyl-D-mannosamine involved in bacterial
server was evaluated by the sophisticated mGen-                                   lipopolysaccharide biosynthesis. The Arabidopsis UDP-
THREADER (multilayered neural network) and the                                    glucuronic acid decarboxylase (ATUXS2, At3g62830)
3D-PSSM servers, which by operating at the fold level                             is predicted to adopt a typical type II membrane protein
in addition to 1D sequence information incorporates                               topology and thought to be located in the Golgi appa-
3D structural information, solvation potential, etc. (see                         ratus (Harper and Bar-Peled, 2002). When used as
also above). The difference in the number of accessions                           a negative control, ATUXS2 passes filter II-III (Rossman
pre-filter IV and post-filter IV (139 and 27, respec-                               fold, non-CAZy entry) but fails to pass filter IV, i.e.
tively) indicate that a major fraction of the 139 acces-                          ATUXS2 do not adopt a GT-A or a GT-B fold structure.
sions, predicted by the SUPERFAMILY server to                                     Furthermore, a DxD motive (as described below) is not
belong to polysaccharide or CW relevant superfami-                                found in ATUXS2. Whereas the E. coli UDP-N-acetyl-D-
lies, were most likely non-GT proteins, as e.g. evi-                              glucosamine 2-epimerase (Kiino et al. 1993; P27828) as
denced by accessions containing an unusually high                                 expected do not adopt a typical type II membrane
number of Pro and Ser residues (.50% of the total                                 protein structure when run through the TMHMM
amino acid residues) or by proteins with an estimated                             version 2.0 server (Filter I), it is predicted to belong to
molecular mass ,20 kD. The 139 non-CAZy classified                                 the UDP-glycosyltransferase/glycogen phosphorylase
accessions resulting from the SUPERFAMILY filtering                                superfamily by the SUPERFAMILY prediction server
and BLAST searches against the local CAZy database                                and is predicted to adopt a GT-B fold by mGen-
are available as supplemental data (available at                                  THREADER. However, a DxD motif as described below
www.plantphysiol.org).                                                            is not found.
                                                                                     When the six known plant CW GTs were run
Elimination of False, For Example, Non-GT, Hits                                   through the 3D-PSSM and mGenTHREADER servers,
                                                                                  the galactomannan-specific a(1-6)galactosyltransfer-
  The ability of the filtering process to eliminate                                ase and the XG-specific a(1-6)xylosytransferase were
accessions that encode enzymes that do have NDP-                                  predicted to adopt the GT-B and the GT-A fold,
sugars as substrate but are non-GTs were examined by                              respectively, although both proteins are classified in
applying the sequential filtering procedure to two                                 CAZy family GT34 (Table III). However, as indicated

Table III. Known type II CW GTs and their classification in the CAZy database
   The glycosyltransferases listed are all from Arabidopsis. From the top, a(1-6)-D-xylose transferase, transfers D-xylose on to the b(1-4)glucan chains of
xyloglucan (Faik et al., 2002); a(1-6)-D-galactose transferase, transfers D-galactose on to the b(1-4)mannan backbone of galactomannan (Edwards et
al., 1999); a(1-2)-L-fucose transferase, transfers the terminal L-fucose on to the galactosyl residue of the xyloglucan sidechain (Perrin et al., 1999);
Quasimodo, involved in pectin biosynthesis* (Bouton et al., 2002); b(1-2)-D-galactose transferase, transfers D-galactose on to the a(1,6)-linked xylose
in xyloglucan (Madson et al., 2003); b(1-4)-D-glucuronosyl transferase, transfers D-glucuronic acid on to a(1-4)-linked Fucose in RG II (Iwai et al.,
2002)*.
                                                                                                               E-Value
     GT Function     TrEMBL Protein ID    CAZy Family    SuperFa   Best Fit to Known GT-foldb                                             Blast (NCBI)
                                                                                                     3D-PSSM       mGenTHREADER
   a(1-6)-D-xylT         Q9LZJ3             GT-34        None               GT-B                     5.08                 0.187        Plant and bacteria
   a(1-6)-D-galT         Q9ST56             GT-34        None               GT-A                     0.934                0.152        Plant and bacteria
   a(1-2)-L-fucT         Q9LJK1             GT-37        None               GT-B                     4.54                Not found     Plant and animal
   Quasimodo             Q9LSG3             GT-8          N                 GT-A                 1.08 3 10212             0.009        Plant and animal
   b(1-2)-D-galT         Q9LVB4             GT-47         N                 None                   Not found             Not found            Plant
   b(1-4)-D-glcAT        Q8GSQ4             GT-47         N                 GT-B                     5.04                 0.090        Plant and animal
                                                                             Pfamc
   TMD         Length Amino Acids           SignalP                                                       DxD in Hydrophobic Pocketd           Isoxaben Array
                                                                   Domain                  E-Value
   21-40               460               Signal anchor      GT-34                    1.1   3   102121                 Yes                      Down 38%
   13-35               435               Signal anchor      GT-34                    1.0   3   102137                 Yes                      Not found
   None                501               Signal anchor      GT-10                    1.6   3   10217                  Yes                      Not found
   21-43               599               Signal anchor      GT-8                     1.4   3   102116                 Yes                      Up 118%
   None                549               Nonsecretory       Exotosin (GT-47)         1.2   3   10292               Not found                   Down 55%
   None                334               Nonsecretory       Exotosin (GT-47)         2.3   3   10231                  Yes                      Down 42%
                                                                a                                          b                                               c
  *Function likely—activity not unequivocally demonstrated.       N, NDP-sugartransferases.                 3D-PSSM and/or mGenTHREADER.                       Hits
                                  d
only shown for E-values , 0.1.      HCA analysis (data not shown).

6 of 12                                                                                                                      Plant Physiol. Vol. 136, 2004
                                                             Alternative Source of Putative Cell Wall Glycosyltransferases


by the poor E-values, discrimination between the          xylosyltransferases mentioned above. Genes in group
GT-A and GT-B fold was not feasible. In this respect      C are approximately 550 amino acids long. Aside from
it should be noted that plants in general synthesize      the four GTs (accession nos. Q9LZ77, Q9M147,
a number of plant-specific CW polymers (not found in       Q9C920, and Q9XEE9 ; Table II; Fig. 2C), which display
any other kingdom). The uniqueness of such struc-         significant similarity to CAZy GT-family-1 and CTP-
tures may be reflected in the structure of the bio-        GTs, similarity for the rest of the 27 sequences to other
synthetic enzymes, and these may thus not be clearly      GTs with known function (plant or non-plant) was
related to GTs of organisms from other kingdoms.          extremely weak or nonexisting.
None of the quite few characterized plant CW GTs
have had their 3D structure resolved. It is in this
context that servers like mGenTHREADER and 3D-            Prediction of Subcellular Localization
PSSM, which besides sequence similarity use various
parameters such as 3D information (see above) in their       For 24 of the 27 putative GTs, the SignalP server
prediction strategy, were chosen as validation tools.     predicted a signal anchor or signal peptide in or close
The prediction ability of the various servers will        to the TMD (data not shown; Table II). When the six
undoubtedly improve as new plant CW GTs are               GTs of group B (Fig. 2) were run through the TargetP
identified and structurally analyzed. In summary, the      server (Krogh et al., 2001), a reliable prediction of their
filtering process applied here was quite efficient in       subcellular location could not be achieved. Similar
eliminating evident types of false positives. The pool    results were obtained when the six GTs with known
of 27 accessions may still comprise non-GT accessions     function in CW synthesis (Edwards et al., 1999; Perrin
and was thus subjected to a post-filtering analysis.       et al., 1999; Bouton et al., 2002; Faik et al., 2002; Iwai
                                                          et al., 2002; Madson et al., 2003; Table III) were run
                                                          through these servers (data not shown). Although
Post-Filtering Evaluation of the 27 Putative GTs          localization of these CW GTs has not been proven,
                                                          sufficient evidence exists for Golgi as the place of
Homologous Sequences within and outside of the
                                                          synthesis of at least the major building blocks of the
Plant Kingdom
                                                          CW (for review, see Mohnen, 1999). Thus, the TargetP
   The presence of homologous sequences was inves-        prediction server is not able to generate a reliable
tigated by subjecting the putative GTs to global pro-     prediction of the localization of the plant CW GTs.
tein-protein BLAST (blastp). The majority of the
searches (22 out of 27) gave rise to plant-specific or
plant- and bacteria-specific hits (Table II). As e.g.      Expression Data
mycobacterial CWs contain plant CW-like polysac-
charides, e.g. arabinogalactans (Crick et al., 2001),        Recently, sets of CW-specific microarray data de-
bacterial hits may not contradict function in plant       rived from suspension cultured cells, which had been
CW synthesis. When the 27 sequences were blasted          exposed to the herbicide N-[3-(1-ethyl-1-methyl-
against the Arabidopsis expressed sequence tag (EST)      propyl)-1,2-oxazol-5-yl]-2,6-dimethoxybenzamide
database, 21 were represented by an EST (Table II).       (isoxaben), have been made public (see ‘‘Materials and
                                                          Methods’’). Isoxaben specifically inhibits cellulose
                                                          synthesis (Scheible et al., 2001), and plants adapted to
Phylogeny                                                 grow in isoxaben compensate for the almost complete
                                                          loss of the cellulose-XG load bearing structure by
   As a consequence of the adopted overall bioin-         construction of walls made predominantly of pectin
formatics approach used in this study, significant         (Schedletzky et al., 1990; Encina et al., 2002). The array-
similarity throughout the 27 accessions was not           derived expression data for the accessions Q9C920,
anticipated. Alignment of the 27 putative GTs, how-       Q9LTZ5 (and the highly homologous Q9FM26 not
ever, identified four groups of clustered homologous       present in the array dataset), and Q9M146, which in
genes, denominated A, B, C, and D, of which groups B      this experiment are up-regulated 52%, 483%, and 59%,
and C (Fig. 2) display a high degree of conservation in   respectively, might indicate function in pectin biosyn-
stretches of at least 20 to 80 amino acids (data not      thesis. However, the lack of confirmed pectin bio-
shown). The rest of the 27 accessions constituted         synthetic GTs and expression pattern (spatial and
a heterogeneous group with extremely low or insig-        temporal) of each putative GT should be taken into
nificant sequence identity.                                consideration when interpreting the array data.
   The genes in group B (Q9C9Q6, Q9C9Q5, Q9FXA7,
Q9M146, Q9ZSJ2, and Q9ZSJ0; Table II; Fig. 2) fall into
two distinct subgroups consisting of highly identical     HCA Analysis
group members (subgroup I: four sequences with 73%,
75%, and 90% identity; subgroup II: two sequences            Hydrophobic cluster analysis (HCA) identified a pu-
with 72% identity) but with only 11% identity between     tative sugar-nucleotide-binding domain—the so-called
the two subgroups. The two highly identical acces-        DxD motif (Breton et al., 1998, 2001; Wiggins and
sions (Q9ZSJ2 and Q9ZSJ0; Table II; Fig. 2B) are the      Munro, 1998; Costa et al., 2002; Stolz and Munro,
Plant Physiol. Vol. 136, 2004                                                                                     7 of 12
Egelund et al.


2002)—or a degeneration of this motif [DE]-X-[DE]                                 Accessions Q9SAD6, Q9LKU7, Q9T0G0, O23479,
(compare with Tarbouriech et al., 2001) surrounded by                          Q9C990, and Q9C991 were predicted to contain other
stretches of hydrophobic amino acids in 19 of the 27                           domains also with varying prediction power. Of these,
sequences as exemplified in Figure 3. It should be                              a putative DxD motif as defined above could not be
stressed, however, that the occurrence of a DxD motif                          identified for the accessions Q9LKU7, Q9T0G0, and
alone is not diagnostic of a GT function (Gastinel, 2001;                      O23479, perhaps suggesting that the proposed GT
Coutinho et al., 2003). Parsing of the 27 sequences                            function for at least these accessions should be con-
through the Multiple EM for Motif Elicitation (MEME)                           sidered carefully. The Pfam server is based on seed
version 3.0 server identified the putative sugar-nucle-                         alignments, including also consensus alignment se-
otide-binding DxD motif in both subgroups of group B                           quences of the various CAZy families. The relatively
(Fig. 4). Genes in group C share common overall                                low number (10) of tentative CAZy GT family re-
features with the group B genes, e.g. varying sequence                         lationship assignments may be due to the Pfam/CAZy
identity (69%, 35%, and 27%) and a DxD motif flanked                            sequence-based prediction strategies versus the
by hydrophobic stretches situated approximately in the                         prediction strategies used by the mGenTHREADER
middle of the proteins (Table II; Fig. 3).                                     and 3D-PSSM servers.
   When run through the Pfam server, for 10 of the
accessions a tentative CAZy GT family relationship                             DISCUSSION
(GT1, GT2, or GT8) could be assigned although the
prediction power (E-value) in most cases was rela-                               In this study, we have identified 27 putative Arabi-
tively poor (Table II).                                                        dopsis GTs, which are not classified in the CAZy




          Figure 3. HCA analysis showing the DxD motif within a pocket of hydrophobic amino acids. The protein sequences are
          represented on a duplicated a -helical net, and the clusters of contiguous hydrophobic residues (V, I, L, F, M, W, and Y) are boxed.
          The actual assessments of the individual HCA plots were done manually in order to identify similarities between the sequences.
          The one-letter code is used for amino acids except for Gly, Pro, Ser, and Thr, which are represented by symbols. Vertical lines
          delineate hydrophobic pockets in which the DxD motif (boxed in black) can be found. Three well-known GTs were used in the
          analysis. A, a(1-4)galactosyltransferase LgtC (Neisseria meningitides, TrEMBL accession no. Q8KHJ3); Persson et al. (2001). B,
          b(1–4)galactosyltransferase (Homo sapiens, TrEMBL accession no. Q9UBX8); Amando et al. (1999). C, Quasimodo—putatively
          involved in pectin biosynthesis (Arabidopsis, TrEMBL accession no. Q94BZ8); Bouton et al. (2002) and representatives from the
          27 putative GTs, containing an identifiable DxD motif within a hydrophobic pocket: D, Q9ZSJ2; E, Q9M146; F, Q9C9Q5; G,
          A9LTZ5; H, Q9FF50; I, O045498.

8 of 12                                                                                                                Plant Physiol. Vol. 136, 2004
                                                                            Alternative Source of Putative Cell Wall Glycosyltransferases


                                                                         ing rather small molecules as acceptor substrates. If
                                                                         the 121 Arabidopsis sequences in GT-family-1 are
                                                                         subtracted from the total 415 sequences, 296 GTs are
                                                                         left for glycosylation of proteins and lipids, synthesis
                                                                         of various polysaccharides, and CW biosynthesis. In
                                                                         Arabidopsis-rich CAZy GT families, such as GT8,
Figure 4. Conserved region of the B group accessions (compare with       GT31, or GT47, alignments of Arabidopsis accessions
phylogenetic analysis Fig. 2) identified by the MEME server showing the
                                                                         reveal the existence of highly identical genes within
putative DxD motif possibly involved in binding of the nucleotide
sugar.
                                                                         the GT families, which are likely to have identical
                                                                         function but may be differentially expressed. For e.g.
                                                                         pectin synthesis alone, one of the major noncellulosic
database. The 27 accessions have been selected as                        CW polysaccharides, which comprises the polysac-
putative GTs, being typical of Golgi localized type II                   charides homogalacturonan and rhamnogalacturonan
membrane proteins and characterized using various                        I and II, at least 53 distinct enzymatic activities are
prediction servers, HCA analysis, and CW-specific                         required (Mohnen, 1999; Ridley et al., 2001).
array datasets. Recent proof of concept of the strategy                     In this study, the use of the most conservative
used in this study has to some extent been obtained as                   transmembrane span prediction server as the first
functions in CW biosynthesis for two GT members of                       filter clearly filters out an unknown number of
the phylogenetically distinct group B (Fig. 2) were                      GTs with a weak TMD profile and perhaps also GTs
established.                                                             without a TMD, which might interact in complexes
   Although the topology of noncellulosic backbone                       with other membrane-bound GTs. A significant num-
synthesizing enzymes remains an open question, it is                     ber of the Arabidopsis sequences, e.g. in the CAZy
tempting to suggest that the enzymes responsible for                     database GT-family-47, do not have a predictable TMD
e.g. the synthesis of the a(1-4)-linked homogalact-                      domain and are therefore often referred to as soluble
uronan backbone or the b(1-4)glucan backbone of XG                       enzymes. Of the six noncellulosic plant CW GTs with
might resemble the multimembrane spanning and                            known function, the XG-specific b(1-2)galactosyltrans-
processive cellulose synthases, which synthesizes ho-                    ferase, the putative rhamnogalacturonan II-specific
mopolymers with the same linkage type. Enzymatic                         b(1-4)glucuronosyltransferase, and the XG-specific
assays of proteinase-treated intact and detergent-                       a(1-2)fucosyltransferase are not predicted to possess
disrupted Golgi vesicles suggest that the catalytic site                 an N-terminal transmembrane helix when run
of the a(1-4)galacturonosyltransferase activity resides                  through the TMHMM 2.0 server used in this study
in the lumen of the Golgi (Sterling et al., 2001). Re-                   (Table III). However, when run through the various
cently, an Arabidopsis gene, classified within CAZy                       prediction servers available at the ARAMEMNON
GT-family-8, was cloned, heterologously expressed,                       site, the three GTs were predicted to contain an
and shown to possess galacturonosyltransferase activ-                    N-terminal TMD-like structure by at least one of the
ity (J. Sterling and D. Mohnen, personal communica-                      programs available at this site. In conclusion, it is
tion). The predicted topology of this enzyme is                          unresolved whether some CW GTs are soluble. Re-
a typical type II membrane-spanning protein, perhaps                     porter gene and or tag fusion experiments may shed
suggesting that at least pectin synthesizing enzymes                     some light on this.
probably are type II membrane-anchored proteins.
Current estimates suggest that about 1% of the open                      CONCLUDING REMARKS
reading frames of each genome is dedicated to the task
of glycosidic bond synthesis (Coutinho et al., 2003).                       The CAZy database serves as the most complex and
When using the 415 CAZy classified Arabidopsis GTs                        rich source of carbohydrate active enzymes. Classifi-
as an estimate of the total number of glycosidic bond-                   cation of GTs in the CAZy database is based primarily
forming enzymes, in plant this number is 1.6%, partly                    on PSI-BLAST searches, using GTs with known func-
due to the existence of the highly complex CW. Based                     tion and in some cases proteins for which the 3D
on the number of Arabidopsis genes that are predicted                    structures have been resolved, as the seed (Henrissat
to possess signal peptides, it has been estimated that                   et al., 2001). With respect to GTs, it is widely accepted
well over 2,000 genes are likely to participate in                       that secondary structure is more conserved than
biosynthesis, assembly, and modification of CWs dur-                      primary sequence. The classification scheme used in
ing plant development (Carpita et al., 2001). If soluble                 the CAZy database may not facilitate the identification
enzymes that participate in generation of substrates                     of GTs that are only similar at a level higher than the
and the integral membrane-associated biosynthetic                        primary sequence level (e.g. the fold level). A draw-
CW proteins, such as the cellulose synthase, are in-                     back of the present alternative approach may be the
cluded, it has been estimated that some 15% of the                       use of the SUPERFAMILY prediction server, which (as
Arabidopsis genome is dedicated to CW biogenesis                         e.g. also Pfam) uses HMMs generated from alignments
and modification (Carpita et al., 2001).                                  of proteins, the vast majority being of non-plant origin.
   CAZy GT-family-1 consists of primarily soluble                        We expect that a higher number of candidate GTs will
enzymes with function in secondary metabolism hav-                       be retrieved when it is possible to screen the entire
Plant Physiol. Vol. 136, 2004                                                                                                    9 of 12
Egelund et al.


Arabidopsis proteome for proteins using servers op-                                generate a Unix list file that served as template for the generation of a FASTA
                                                                                   file from the Arabidopsis proteome. The BLAST 2.6.6 program for powerpc-
erating at the fold level.                                                         MacOSX was downloaded from ftp://ftp.ncbi.nih.gov and a BLASTable
   GTs situated in the Golgi apparatus involved in                                 database built from the FASTA file as described by the provider.
synthesis of the complex Asn-linked glycans of plant                                   The five independent FASTA files derived from the superfamily search
glycoproteins may be found among the accessions                                    were blasted against the local CAZy database using the BBEdit program
uncovered in this study. We expect that a significant                               (standard conditions with filtering off). Proteins in the dataset, which were
                                                                                   found in the local CAZy database, were discarded.
proportion of plant proteoglycan GTs are homologous
to similar enzymes from other eukaryotes due to the
structural similarities that exist in these glycans across                         3D-PSSM Server
kingdoms. If this assumption is valid, many of the                                     Fast Web-based methods for protein fold recognition using 1D and 3D
plant proteoglycan GTs are already in CAZy.                                        sequence profiles coupled with secondary structure information, i.e. SCOP-
   The sequential and parallel use of several prediction                           based profile HMMs, included the following: 3D-PSSM Web Server version
servers, albeit with relatively low stringency parameter                           2.6.0 (http://www.sbg.bio.ic.ac.uk; Kelley et al., 2000) and mGenTHREADER
                                                                                   available at the PSIPRED home page (http://bioinf.cs.ucl.ac.uk/psiform.
settings, inclusion of negative and positive controls of                           html). All predictions were performed using standard settings. Proteins larger
the filtering, followed by a post-filtering evaluation                               than 800 amino acids were submitted twice either with truncations in the N or
warrant that a substantial number of GTs indeed are                                the C terminus. The outputs were then collected and screened for known GT
found within the 27 accessions. It is, however, also clear                         PDB IDs. If more than one PDB ID was present in the output from the same
that e.g. the use of the conservative TMHMM server has                             file, only the one with the lowest E-value was listed.

as a consequence that relevant GTs have also been
eliminated and, hence, that the 27 putative GTs are but                            mGenTHREADER Server
a subset of the GTs that remain to be recognized as such.
                                                                                      mGenTHREADER (Jones, 1999; http://bioinf.cs.ucl.ac.uk/psipred/) is
                                                                                   a fold recognition server based on fold library profiles that uses the PSI-
                                                                                   BLAST profile and predicted secondary structure (PSIPRED). PSIPRED is
MATERIALS AND METHODS                                                              a secondary structure prediction method, incorporating two feed-forward
                                                                                   neural networks, which takes the output obtained from PSI-BLAST as input.
The Arabidopsis Proteome
                                                                                   mGenTHREADER, accessible from the PSIPRED Protein Structure Prediction
   The Arabidopsis proteome in a nonredundant form was downloaded from             Servers home page, was used with the following parameters: prediction
EMBL (http://www.ebi.ac.uk/proteome/index.html; 08072003), converted to            method, Fold Recognition (mGenTHREADER); output, default. The outputs
FASTA format, and split into 26,095 individual proteins using the Wisconsin-       were then collected and screened for known PDB IDs of known GTs (compare
package version 10.3 (http://www.biobase.dk/).                                     with Table I). If more than one PDB ID was present in the output from the
                                                                                   same file, only the one with the lowest E-value was listed.

TMHMM Version 2.0 Server
                                                                                   PDB IDs
    Predictions of transmembrane helices were carried out using the TMHMM
server version 2.0 (Krogh et al., 2001; http://www.cbs.dtu.dk/services/               The PDB IDs that were used for screening the output of both the 3D-PSSM
TMHMM). All predictions were performed using standard settings. The                and mGenTHREADER were collected from Wimmerova et al. (2003), who
proteome was submitted in subfiles (FASTA file format) containing the max            searched the Mycobacterium tuberculosis genome for GTs using, among other
limit of 4,000 proteins per submission. Output format was one text line per        tools, fold recognition and the CAZy database (when the individual CAZy GT
protein. The outputs from the entire proteome were collected in a single file for   family contained more than one PDB ID for the particular protein, only one of
further screening. Proteins containing one or two TMDs starting within the         the PDB IDs was used). References for the PDB IDs can be retrieved at http://
N-terminal first 150 amino acid residues were extracted using the BBEdit            www.RCSB.org/. All proteins were classified to one of the two secondary fold
program version 6.1.2 for MacOSX. A Unix list file containing the resulting         structures, GT-A or GT-B.
accession numbers was generated, and these accession numbers were then used
to extract relevant protein sequences from the original proteome and stored in
a FASTA file. This FASTA file was then used in the next filtering process.            BLAST
                                                                                      As part of the validation process, the proteins were blasted using BLAST
Superfamily Version 1.63 Server
                                                                                   algorithms, which were accessible from the server at NCBI (National Center
    The SUPERFAMILY facility (http://supfam.mrc-lmb.cam.ac.uk/SUPER-               of Biotechnology Information; http://ww.ncbi.nlm.nih.gov). The search
FAMILY/) implements a searchable library (covering all proteins of known           included standard protein blast (blastp) and translated blast (tblastn). All
structure) consisting of 1,232 SCOP superfamilies, each of which is represented    searches were performed using standard settings and the BLOSUM 80 matrix.
by a group of HMMs, i.e. SCOP-based single sequence HMMs (Gough et al.,            In the case of blastp, any hits, regardless of the e-value/identity, to animal,
2001). The post-TMHMM FASTA file was divided into files containing                   bacterial, or plant sequences were reported. Presence of ESTs was checked by
a maximum of 20 proteins. These files were submitted using the following            blastn searches of the Arabidopsis EST database (http://ww.ncbi.nlm.nih.
parameters: scoring options, Global/Global model/sequence scoring (for exact       gov).
domains), BLAST pre-filter P , 1.0 3 10210. The output files were collected in
one large file and sorted after superfamily domain. Proteins that were classified
as belonging to one of the superfamilies listed below were independently           SignalP Server
collected using a Unix list file with relevant accession numbers, and a FASTA
                                                                                       The candidate genes were scanned for the presence of signal peptides
file was generated for each of the five superfamilies used in this study:
                                                                                   using the SignalP version 2.0.b2 server (Nielsen et al., 1999) World Wide Web
NDP-sugartransferases, Gal-binding domain-like, UDP-glycosyltransferase/
                                                                                   server (http://www.cbs.dtu.dk/services/SignalP), which predicts the pres-
glycogen phosphorylase, carbohydrate-binding domain, and Rossman fold.
                                                                                   ence and location of signal peptide cleavage sites in amino acid sequences
                                                                                   using HMM-based predictions (Nielsen et al., 1999). Predictions were done
Local CAZy Database                                                                using the following parameters: organism group, Eukaryotes; method,
                                                                                   HMMs; graphics, none; output format, standard. Proteins were truncated
  All the Arabidopsis protein accession numbers were collected from the            after the first 70 amino acids from the N terminus and submitted in a FASTA
CAZy database (September 10, 2003). These accessions were then used to             file format.

10 of 12                                                                                                                     Plant Physiol. Vol. 136, 2004
                                                                                      Alternative Source of Putative Cell Wall Glycosyltransferases


MEME Version 3.0                                                                  Q9C920, Q9LTZ5, Q9FM26, and Q9M146, Q9SAD6, Q9LKU7, Q9T0G0,
                                                                                  O23479, Q9C990, Q9C991, Q9LKU7, Q9T0G0, and O23479.
  Sequences, for which the secondary structure resembled that of known
GTs, were submitted as a FASTA file to the MEME v 3.0 server (http://
meme.sdsc.edu/meme/website/meme.html) in order to search for conserved
domains. Standard settings were used for the search.
                                                                                  ACKNOWLEDGMENTS
                                                                                      Dr. Ahmed Faik, Michigan State University, is acknowledged for in-
                                                                                  stigating this line of research in our lab; Dr. Christelle Breton, INRA France,
Alignments and Phylogenetic Analysis                                              for helpful discussions and initial HCA analysis; and Dr. Kristian Axelsen,
                                                                                  Institute of Plant Biology, The Royal Veterinary and Agricultural University,
    All sequence alignments and calculations of sequence identities were          Denmark, and Swiss Institute of Bioinformatics, Geneva, for helpful dis-
                                                                     ´
performed by use of ClustalX version 1.81, available from Universite Louis        cussions and propositions throughout the process. Dr. Julian Gough and
Pasteur, Strasbourg (ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX; Thompson          Ph.D. student Martin Madera are greatly appreciated for their skillful help
et al., 1997). Alignments were edited and printed using the program SeqVu         with submission to the SUPERFAMILY server. Dr. William G.T. Willats is
(SeqVu version 1.0.1; http://www.cellbiol.com/soft.htm). Trees with boot-         acknowledged for providing corrected array data.
strap values from 1,000 resampling replicates were obtained using the Njplot
program, which is part of the ClustalX program package. Printed trees were        Received March 17, 2004; returned for revision April 15, 2004; accepted April
modified using the TreeViewPPC version 1.6.6 software (http://                     20, 2004.
taxonomy.zoology.gla.ac.uk/rod/treeview.html).


HCA Analysis                                                                      LITERATURE CITED

   HCA plots were obtained from the drawhca server on the Internet (http://       Amando M, Almeida R, Schwientek T, Clausen H (1999) Identification
smi.snv.jussieu.fr/hca/hca-form.html). The actual assessments of the indi-           and characterization of large galactosyltransferase gene families: galac-
vidual HCA plots were done manually as described by Breton et al. (1998).            tosyltransferases for all functions. Biochim Biophys Acta 1473: 35–53
                                                                                  Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S,
                                                                                     Khanna A, Marshall M, Moxon S, Sonnhammer ELL, et al (2004) The
Pfam                                                                                 Pfam protein families database. Nucleic Acids Res 32: D138–D141
                                                                                  Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
   Proteins were analyzed for the presence of known domains using the Pfam           Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids
HMM (Bateman et al., 2004) available at the St. Louis Pfam server (http://           Res 28: 235–242
pfam.wustl.edu). The searches were performed using individual global/local        Bolwell GP, Northcote DH (1983) Arabinan synthase and xylan synthase
search options and a cutoff E-value . 0.1. Only the best hits were reported.         activities of Phaseolus vulgaris. Subcellular localization and possible
                                                                                     mechanism of action. Biochem J 210: 497–507
                                                                                  Bourne Y, Henrissat B (2001) Glycoside hydrolases and glycosyltransfer-
ARAMEMNON
                                                                                     ases: families and functional modules. Curr Opin Struct Biol 11: 593–600
    ARAMEMNON (Schwacke et al., 2003; Arabidopsis Membrane Protein                Bouton S, Leboeuf E, Mouille G, Leydecker M-T, Talbotec J, Granier G,
Database at http://aramemnon.botanik.uni-koeln.de/) consolidates predic-                             ¨
                                                                                     Lahaye M, Hofte H, Truong N-H (2002) QUASIMODO1 encodes
tion of transmembrane helixes based on several TMD prediction servers.               a putative membrane-bound glycosyltransferase required for normal
ARAMEMNON uses the following servers: Alom_v2 (http://psort.nibb.ac.jp/              pectin synthesis and cell adhesion in Arabidopsis. Plant Cell 14: 1–14
form2.html); HmmTop_v2 (http://www.enzim.hu/hmmtop/html/submit.                   Breton C, Bettler E, Joziasse DH, Geremia RA, Imberty A (1998) Sequence-
html); MemSat_v1.8 (http://bioinf.cs.ucl.ac.uk/psiform.html); PHDhtm                 function relationships of prokaryotic and eukaryotic galactosyltrans-
(http://cubic.bioc.columbia.edu/predictprotein/submit_exp.html#top);                 ferases. J Biochem (Tokyo) 123: 1000–1009
PHDhtm       (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page5/             Breton C, Mucha J, Jeanneau C (2001) Structural and functional features of
NPSA/npsa_phd.html); PredTmr_v1 (http://biophysics.biol.uoa.gr/                      glycosyltransferases. Biochimie 83: 713–718
PRED-TMR/input.html); SosuiG_v1.1 (http://sosui.proteome.bio.tuat.                Carpita N, Tierney M, Campbel M (2001) Molecular biology of the plant
ac.jp/cgi-bin/sosui.cgi?/sosui_submit.html); THUMBUP_v1 (http://                     cell wall: searching for the genes that define structure, architecture and
phyyz4.med.buffalo.edu/Softwares-Services_files/thumbup.htm); Tmap                    dynamics. Plant Mol Biol 47: 1–5
(http://www.mbb.ki.se/tmap/); TMHMM_v2 (http://www.cbs.dtu.dk/                    Carpita NC (1996) Structure and biogenesis of the cell walls of grasses.
services/TMHMM/); TmPred (http://www.ch.embnet.org/software/                         Annu Rev Plant Physiol Plant Mol Biol 47: 445–476
TMPRED_form.html);       and     TopPred_v2     (http://bioweb.pasteur.           Cosgrove DJ (1997) Relaxation in a high-stress environment: the molecular
fr/seqanal/interfaces/toppred.html).                                                 bases of extensible cell walls and cell enlargement. Plant Cell 9:
                                                                                     1031–1041
                                                                                  Costa AA, Gomez FJ, Pereira M, Felipe MS, Jesuino RS, Deepe GS, de
Array Data                                                                           Almeida Soares CM (2002) Characterization of a gene which encodes
                                                                                     a mannosyltransferase homolog of Paracoccidioides brasiliensis. Microbes
   Isoxaben array data are available at http://affymetrix.arabidopsis.info/          Infect 4: 1027–1034
narrays/experimentbrowse.pl.                                                      Coutinho PM, Deleury E, Davies GJ, Henrissat H (2003) An evolving
                                                                                     hierarchical family classification for glycosyltransferases. J Mol Biol 328:
                                                                                     307–317
Distribution of Materials                                                         Crick DC, Mahapatra S, Brennan PJ (2001) Biosynthesis of the
                                                                                     arabinogalactan-peptidoglycan complex of Mycobacterium tuberculosis.
   Upon request, all novel materials described in this publication will be           Glycobiology 11: 107R–118R
made available in a timely manner for noncommercial research purposes,            Edwards ME, Dickson CA, Chengappa S, Christopher C, Michael J,
subject to the requisite permission from any third party owners of all or parts      Gidley MJ, Grant Reid SJ (1999) Molecular characterisation of a mem-
of the material. Obtaining any permission will be the responsibility of the          brane-bound galactosyltransferase of plant cell wall matrix polysaccha-
requestor. Access to the novel accessions reported in this manuscript can be         ride biosynthesis. Plant J 19: 691–697
requested by e-mail (j.egelund@dias.kvl.dk).                                      Encina A, Sevillano JM, Acebes JL, Alvarez J (2002) Cell wall modifica-
                                                                                     tions of bean (Phaseolus vulgaris) cell suspensions during habituation
  Sequence data from this article have been deposited with the EMBL/                 and dehabituation to dichlobenil. Physiol Plant 114: 182–191
GenBank data libraries under accession numbers Q8KHJ3, Q9UBX8, Q94BZ8,            Faik A, Price NC, Raikhel NV, Keegstra K (2002) An Arabidopsis gene
Q9ZSJ2, Q9M146, Q9C9Q5, A9LTZ5, Q9FF50, O045498, Q9C9Q6, Q9C9Q5,                     encoding an a-xylosyltransferase involved in xyloglucan biosynthesis.
Q9FXA7, Q9M146, Q9ZSJ2, Q9ZSJ0, Q9LZ77, Q9M147, Q9C920, Q9XEE9,                      Proc Natl Acad Sci USA 99: 7797–7802

Plant Physiol. Vol. 136, 2004                                                                                                                         11 of 12
Egelund et al.


Fry SC (1995) Polysaccharide-modifying enzymes in the plant cell wall.         Perrin RM, DeRocher AE, Bar-Peled M, Zeng W, Norambuena L, Orellana
   Annu Rev Plant Physiol Plant Mol Biol 46: 497–520                              A, Raikhel NV, Keegstra K (1999) Xyloglucan fucosyltransferase,
Gastinel LN (2001) Galactosyltransferases: a structural overview of their         an enzyme involved in plant cell wall biosynthesis. Science 284:
   function and reaction mechanisms. Trends Glycosci Glycotechnol 13:             1976–1979
   131–145                                                                     Persson K, Ly HD, Dieckelmann M, Wakarchuk WW, Withers SG,
Geshi N, Jørgensen B, Ulvskov P (2004) Subcellular localization and               Strynadka NCJ (2001) Crystal structure of the retaining galactosyl-
   topology of b(1,4)galactosyltransferase in potato. Planta 218: 862–868         transferase LgtC from Neisseria meningitidisin in complex with donor
Gough J, Chothia C (2002) SUPERFAMILY: HMMs representing all                      and acceptor sugar analogs. Nat Struct Biol 8: 166–175
   proteins of known structure. SCOP sequence searches, alignments and         Ridley BL, O’Neill MA, Mohnen D (2001) Pectins: structure, bio-
   genome assignments. Nucleic Acids Res 30: 268–272                              synthesis, and oligogalacturonide-related signaling. Phytochemistry
Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of                      57: 929–967
   homology to genome sequences using a library of hidden Markov models        Schedletzky E, Shmuel M, Delmer DP, Lamport TA (1990) Adaptation and
   that represent all proteins of known structure. J Mol Biol 313: 903–919        growth of tomato cells on the herbicide 2,6-dichlorobenzonitrile leads to
Harper AD, Bar-Peled M (2002) Biosynthesis of UDP-xylose. Cloning and             production of unique cell walls virtually lacking a cellulose-xyloglucan
   characterization of a novel Arabidopsis gene family, UXS, encoding             network. Plant Physiol 94: 980–987
   soluble and putative membrane-bound UDP-glucuronic acid decarbox-           Scheible W-R, Eshed R, Richmond T, Delmer D, Somerville C (2001)
   ylase isoforms. Plant Physiol 130: 2188–2198                                   Modifications of cellulose synthase confer resistance to isoxaben and
Henrissat B, Coutinho PM, Davies J (2001) A census of carbohydrate-active         thiazolidinone herbicides in Arabidopsis Ixr1 mutants. Proc Natl Acad
   enzymes in the genome of Arabidopsis thaliana. Plant Mol Biol 47: 55–72        Sci USA 98: 10079–10084
Hu Y, Walker S (2002) Remarkable structural similarities between diverse       Schwacke R, Schneider A, van der Graaff E, Fischer K,
   glycosyltransferases. Chem Biol 9: 1287–1296                                   Catoni E, Desimone M, Frommer WB, Flugge UI, Kunze R (2003)
Iwai H, Masaoka N, Ishii T, Satoh S (2002) A pectin glucuronosyltransfer-         ARAMEMNON, a novel database for Arabidopsis integral membrane
   ase gene is essential for intercellular attachment in the plant meristem.      proteins. Plant Physiol 131: 16–26
   Proc Natl Acad Sci USA 99: 16319–16324                                      Sherrier DJ, VandenBosch KA (1994) Secretion of cell wall polysaccharides
Jones DT (1999) GenTHREADER: an efficient and reliable protein fold                in Vicia root hairs. Plant J 5: 185–195
   recognition method for genomic sequences. J Mol Biol 287: 797–815           Sterling JD, Quigley HF, Orellana A, Mohnen D (2001) The catalytic
Kelley LA, MacCallum RM, Sternberg MJE (2000) Enhanced genome                     site of the pectin biosynthetic enzyme a-(1,4)-galacturonosyl-
   annotation using structural profiles in the program 3D-PSSM. J Mol Biol         transferase is located in the lumen of the Golgi. Plant Physiol 127:
   299: 499–520                                                                   360–371
Kiino DR, Licudine R, Wilt K, Yang DH, Rothman-Denes LB (1993) A               Stolz J, Munro S (2002) The components of the Saccharomyces cerevisiae
   cytoplasmic protein, NfrC, is required for bacteriophage N4 adsorption.        mannosyltransferase complex M-Pol I have distinct functions in man-
   J Bacteriol 175: 7074–7080                                                     nan synthesis. J Biol Chem 277: 44801–44808
Krogh A, Larsson B, von Heijne G, Sonnhammer ELL (2001) Predicting             Tarbouriech N, Charnock SJ, Davies GJ (2001) Three-dimensional
   transmembrane protein topology with a hidden Markov model: appli-              structures of the Mn and Mg dTDP complexes of the family GT-2
   cation to complete genomes. J Mol Biol 305: 567–580                            glycosyltransferase SpsA: a comparison of related NDP-sugar
MacDougal AJ, Rigby NM, Ring SC (1997) Phase separation of plant cell             glycosyltransferases. J Mol Biol 324: 655–661
   wall polysaccharides and its implications for cell wall assembly. Plant     Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997)
   Physiol 114: 353–362                                                           The CLUSTAL_X windows interface: flexible strategies for multiple
Madson M, Dunand C, Li X, Verma R, Vanzin GF, Caplan J, Shoue DA,                 sequence alignment aided by quality analysis tools. Nucleic Acids Res
   Carpita NC, Reiter W-D (2003) The MUR3 gene of Arabidopsis encodes             25: 4876–4882
   a xyloglucan galactosyltransferase that is evolutionarily related to        Wimmerova M, Engelsen SB, Bettler E, Breton C, Imberty A (2003)
   animal exotosins. Plant Cell 15: 1662–1670                                     Combining fold recognition and exploratory data analysis for searching
McGuffin LJ, Jones DT (2003) Improvement of the GenTHREADER                        for glycosyltransferases in the genome of mycobacterium tuberculosis.
   method for genomic fold recognition. Bioinformatics 19: 874–881                Biochimie 85: 691–700
Mohnen D (1999) Biosynthesis of pectins and galactomannans. In BM              Wiggins CA, Munro S (1998) Activity of the yeast MNN1 a-1,3-mannosyl-
   Pinto, ed, Comprehensive Natural Products Chemistry, Volume 3:                 transferase requires a motif conserved in many other families of
   Carbohydrates and Their Derivatives Including Tannins, Cellulose,              glycosyltransferases. Proc Natl Acad Sci USA 95: 7945–7950
   and Related Lignins. Elsevier, Oxford, pp 497–527                           Zhang GF, Staehelin LA (1992) Functional compartmentation of the Golgi
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural             apparatus of plant cells. Immunocytochemical analysis of high-pressure
   classification of proteins database for the investigation of sequences and      frozen- and freeze-substituted sycamore maple suspension culture cells.
   structures. J Mol Biol 247: 536–540                                            Plant Physiol 99: 1070–1083
Nielsen H, Brunak S, von Heijne G (1999) Machine learning approaches           Zhou H, Zhou Y (2003) Predicting the topology of transmembrane helical
   for the prediction of signal peptides and other protein sorting signals.       proteins using mean burial propensity and a hidden-Markov-model-
   Protein Eng 12: 3–9                                                            based method. Protein Sci 12: 1547–1555




12 of 12                                                                                                               Plant Physiol. Vol. 136, 2004