Sector classification through non-Gaussian similarity

Document Sample
Sector classification through non-Gaussian similarity Powered By Docstoc
					Sector Classification through non-Gaussian Similarity
M. Vermorken , A. Szafarz and H. Pirotte
Standard sector classification frameworks present drawbacks that might hinder portfolio manager. This paper introduces a new non-parametric approach to equity classification. Returns are decomposed into their fundamental drivers through Independent Component Analysis (ICA). Stocks are then classified according to the relative importance of identified fundamental drivers for their returns. A method is developed permitting the quantification of these dependencies, using a similarity index. Hierarchical clustering allows for grouping the stocks into new classes. The resulting classes are compared with those from the 2-digit GICS system for U.S. blue chip companies. It is shown that specific relations between stocks are not captured by the GICS framework. The method is applied on two different samples and tested for robustness. JEL Classifications: G11, G19 Keywords: equity sectors, industry classification, portfolio management.

CEB Working Paper N° 08/032 October 2008

Université Libre de Bruxelles – Solvay Business School – Centre Emile Bernheim ULB CP 145/01 50, avenue F.D. Roosevelt 1050 Brussels – BELGIUM e-mail: Tel. : +32 (0)2/650.48.64 Fax : +32 (0)2/650.41.88

Sector Classication through non-Gaussian Similarity
Maximilian Vermorken , Ariane Szafarz and Hugues Pirotte October 23, 2008

Maximilian Vermorken

Research Fellow Centre Emile Bernheim - Université Libre de Bruxelles Avenue F.D. Roosevelt, 50, CP 145/1 1050 Bruxelles, Belgium. Tel:+32 (0)2 650.48.64 Ariane Szafarz

Professor of Finance and Director of the Centre Emile Bernheim - Université Libre de Bruxelles Avenue F.D. Roosevelt, 50, CP 145/1 1050 Bruxelles, Belgium. Tel:+32 (0)2 650.48.65 Hugues Pirotte

Professor of Finance Centre Emile Bernheim - Université Libre de Bruxelles Avenue F.D. Roosevelt, 50, CP 145/1 1050 Bruxelles, Belgium. Tel:+32 (0)2 650.48.64



Standard sector classication frameworks present drawbacks that might hinder portfolio manager. This paper introduces a new non-parametric approach to equity classication. Returns are decomposed into their fundamental drivers through Independent Component Analysis (ICA). Stocks are then classied according to the relative importance of identied fundamental drivers for their returns. A method is developed permitting the quantication of these dependencies, using a similarity index. Hierarchical clustering allows for grouping the stocks into new classes. The resulting classes are compared with those from the 2-digit GICS system for U.S. blue chip companies. It is shown that specic relations between stocks are not captured by the GICS framework. The method is applied on two dierent samples and tested for robustness.

JEL: G11, G19 Keywords: equity sectors, industry classication, portfolio management.


The primary objective of equity classication is facilitate for ecient diversication and consequently, reduce the exposure of a portfolio to specic risk, by summarizing the overall equity landscape. Studies of common characteristics in equity returns have resulted in the introduction of various criteria for the classication of equity into grand classes, based on within class homogeneity. Style, industry aliation, and geographical location are among the most popular criteria in use. This paper proposes a new approach to equity classication using nonparametric estimation based on the non-Gaussianity of nancial returns, without an a-prior denition of the classication criteria. Financial returns depend on a large variety of factors and present heavy tails, leptokurtosis and asymmetry (Kon, 1984; Mills, 1985; Peiro, 1999; Premaratne and Bera, 2000; Patton, 2004). Criteria for grouping stocks should endeavor to include statistical and nancial economic properties, i.e. non-

normality and the dependence upon fundamental drivers. Most methods in use x the criteria used to cluster equity and lack exibility to adapt to changing economic circumstances. For instance the evolution of the business activities, within a given industry, is usually ignored. Additionally, stock returns show signicant non-Gaussianity, implying that statistical clustering methods using the normality hypothesis such as using Pearson's correlation as a distance measure, will be less accurate.


Through signal decomposition the non-Gaussian characteristics of equity returns are extracted and serve as the pillar for the presented equity classication method. The method therefore hinges upon two essential aspects: Firstly extracting non-Gaussian characteristics of stock returns through Independent Component Analysis. Secondly, using these characteristics in classifying the considered stocks based on their dependence upon the extracted characteristics. Independent Component Analysis (ICA) introduced by Jutten and

Herault (1986, 1991) and Comon (1994), and Hyvärinen and Oja (1997), has proven to be very successful in letting data express their hidden structure. It provides a representation of the joint co-movement of multivariate data as a linear combination of sources, estimated under the condition of maximal stochastic independence. The technique has been applied in a variety of elds including brain imaging (Vigario et al., 1998 and 2000) and telecommunication (Cristescu et al., 2000), and in nance as a noise reduction tool by Back and Weigend (1997) or for the construction of factor models by Chan and Cha (2000) and Vessereau (2000). The method presented in this paper shares common grounds with equity classication using statistical decomposition methods and clustering techniques (Elton and Gruber, 1970; Farrell, 1974; Brown and Goetzman, 1997; Kao and Schumaker, 1999). It therefore endeavors to identify relationships Industry aliation is shown by Connor

between stocks of dierent sectors.

(1995) to add signicant explanatory power to a fundamental factor model in


the explanation of equity return data. Similar results were obtained by Roll (1992) who indicates the important eect of industry concentration on equity markets. According to Chan et al. (2007), industry-based grouping methods are more capable of reecting co-movement in the operating performance of corporations than statistical clustering methods based on Pearson's correlation coecient. Baca et al. (2000) exhibit the importance of sector factors

in the explanation of national equity returns for seven major industrial nations including the United States. Industrial factors have also been shown to possess interesting features in periods of international crises. Beckers, Connor and Curds (1996) Heston and Rouwenhorst (1995) and L'Her and Sy Tnani (2002) investigated the period 1975-2000 and conclude that country specic risk is of more importance than industry specic risk. The contrary is found by Baca, Garbe and Weiss (2000), Cavaglia et al. al. (2000) and Hamelink et

(2001) who suggest that diversication across industry provides greater These results are

risk reduction than diversication using country factors1 .

generally obtained from regression models assessing country or industry effects using dummy variables. Additionally, Beine et al. (2006) indicate more robustness against contagious eects and interdependence's in international markets when industry factors are used rather than geographic ones. As such, the actual classication based on industry aliation remains challenging.
1 in more recent years, which has been expected by the advocates of the advantages of globalization.


A number of industry aliation frameworks have been presented in the past. Standard Industrial Classication Codes (SIC) use end product or production method as a discriminatory criterion. The Fama and French (1997) framework is based on the SIC while reorganizing the result into 48 industry groups. Clarke (1989) expresses some doubt on these classication methods as end products, technology and the activity of companies tend to change over time. The Global Industry Classication System (GICS), developed by MSCI and Standard & Poor's, uses product and production characteristics as well as the investor's perception of the company as classication criteria. This classi-

cation method has gained signicant attention by practitioners. Standard & Poor's for instance, uses it in the construction of indices. The classes of the GICS framework do have the advantage of being based on company characteristics which are easily observable. Also, investors' perception is taken into account, including some adaptability into the system. Our nonparametric approach yields results suggesting that GICS inadequately distributes stocks among classes. The extracted fundamental drivers indicate dependencies of a dierent nature between the analyzed stocks. The GICS is used here as a reference in the analysis performed on samples of U.S. blue chip companies. The remainder of this paper is structured as follows. description of the methodology is given. presented. In section 2, a full

In section 3, the selected data are

Section 4 discusses the empirical results which are tested for ro-


bustness. A conclusive summary is given in the last section.

The methodology used in this paper has three parts. First non-Gaussian and statistically independent components are extracted from stock returns. This task is performed through an ICA algorithm. Second, retrieved independent components are sorted by their importance to each stock. The same ponents are present in each of the




stocks, though with varying importance.

Only some of these independent components are considered signicant. However, they are not consistently the same ones for each stock. The essence of this classication system therefore lies in nding the common signicant components of each stock. Thirdly, after selecting a limited number of signicant components (retained components) for each stock, it is possible to compute a distance measure between the stocks. This measure, based on the num-

ber of common components, quanties the shared dependence upon specic non-Gaussian characteristics two stocks have in common. It can be seen as a correlation metric, however, avoiding the normality hypothesis and using the true distribution of stock returns. puted distance. clusters. Clusters are then formed using the com-

Dunn's index is used to assess the quality of the obtained


First Step: Extracting non-Gaussian components

Consider a group of with

n n

equity returns, series of length

r1 , .., rn , m

observed during


periods matrix,

t = 1, .., m.


are piled up in an


R = (rit ), where rit is the value of ri observed at time t. n
statistically independent series,

ICA decomposes

R into

s1 , .., sn

, called independent components,

using the following linear decomposition:

R = WS   r11 . . . r1m  . . ..  . . . .  .  rn1 . . . . rnm




   s11 . . . s1m  . . ..  . . . .  .  sn1 . . . . snm

     

  w11 . . . w1n   . . .. = . . . .   .   wn1 . . . . wnn

S = (sit ) is an n×m matrix with rows corresponding to the independent W = (wij )
is an

components, and


weighting matrix. Each equity return

series is therefore expressed as a weighted sum of Matrix


independent components.


is the only observable element from which




have to be

deduced. The decomposition is based on the maximization of non-Gaussianity within the multivariate matrix


The independent components are therefore

latent variables. No particular ordering of independent components is provided by the algorithm and the importance of each component for a particular stock is variable. The problem of extracting


independent components knowing

only the returns of the stocks (R matrix ) has been solved by Blind Source


Separation type algorithms. These algorithms use information theory in order to extract


components from a multivariate return distribution


Series are

sought inside the multivariate distribution that are maximally non-Gaussian. Non-Gaussianity can be measured using several indicators including kurtosis, mutual information, and negentropy. The latter is a measure based on the Entropy is a quantication of

information theoretical concept of Entropy.

the randomness and uncertainty associated with statistical series. Negentropy quanties the entropy dierence existing between a series and its Gaussian equivalent. (See appendix A for more details). Once the


independent components are extracted from the


stocks and

weighted they need to be sorted. For each stock a sorted list of



independent components is constructed based upon the same unique list of


independent components. The independent components have dierent imOnly part of the events included in

portance for the returns of each stock.

the dierent components are relevant for a particular stock. The independent components will therefore carry stock specic weights. These weights are however insucient to give a denitive order to the components. As a stock is

reconstructed by cumulatively summing up its weighted independent components, the total reconstructed information will vary with each new member. It is the information added to the reconstruction of the stock that indicates the importance of each independent component. We use the Mean Square

Error measure (MSE) in a method based upon Cheung and Xu (2001). The


Mean Square Error between the partially reconstructed stock and the original series computes a value for the amount of information that is added with each new component in the cumulative sum. For each


we nd a series of


ordered components.

v(i) ¯

is the series of ordered and weighted components Let

for stock


which is initially empty.

u(i) ¯

be the series of weighted and

unsorted independent components which initially contains all dependent components. The as


weighted in-


weighted independent component be dened

ui = wij sj , 1

j ≤ n.

The reconstruction error between the stock and the


summed up independent components is dened as,

M SE(ri − xp ) ˆl



xp ˆl


(¯(i)l + u(i)k ) v ¯


u(i)k ∈ u(i) and u(i)k ∈ v(i) ¯ ¯ ¯ /¯ i
is found by

The optimal order of the independent components for a stock minimizing,

v(i)i = argmin(M SE(ri − xp )) ∀ u(i)k ∈ u(i) ¯ ¯ ˆl ¯
u(i)k ¯
When the minimum is found,


u(i)k is ¯

removed from the list

u(i) . ¯

As the data used includes noise and as a satisfactory reconstruction of the stocks can only be achieved using a fraction of the

n independent components,


q independent components will be retained for the analysis.
satisfactory reconstruction of the data.

Back and Weigend

(1997) used four components (out of twenty-eight, in their case) to achieve In this analysis a variable number

of components is considered for each stock. The MSE criterion is computed for each stock and each component added. Figure clearly shows that the

MSE varies greatly across the sample and declines rapidly. The threshold is therefore set at a MSE equaling 0,1.

Evolution of the average Mean Square Error between the returns of the considered stocks and the summed up independent components for each GICS sector. Calculations are made on the ICA decomposition of a sample of 98 stocks. The average is taken across the number of stocks of each sector.
Figure 1:
For each stock the number of retained components is dependent upon the rapidity with which the MSE declines towards the set threshold of 0,1.


The rst


independent components of each series

¯ Vn = (¯(i) ) v

are retained

for each stock. These components represent the fundamental drivers on which stocks depend the most. Determining the intersections between these groups of components for each stock enables us to quantify the common dependence between stocks on these fundamental drivers.

Second Step: Similarity Index

Using the tripartite similarity index by Tulloss (1997) we quantify the similarity between two stocks

ri , rj

as the value

tri, rj

piled up into the



T = (tri, rj ).

(see appendix B). Tulloss index quanties the intersection between

two lists of sorted independent components. As these components represent non-Gaussian eects and their relative importance to the reconstruction of the considered stocks, the index can be seen as a non-parametric correlation coecient representing common non-Gaussian eects. In practice it takes into account both the size of the intersection and the number of components present in one but not in the other list. As the length of the lists of components is variable across stocks, this index shows signicant advantages, for instance, rank correlations. At this stage, each stock in the sample has been dissected into its fundamental drivers, the independent components. The stocks have then been reconstructed in order to identify which components matter most to each stock. The intersection between the lists of ordered components is then found in order to


quantify the common dependence of the stocks upon the same components.



Third Step: Stock Clustering

The third step involves grouping the securities. Using hierarchical clustering, the test sample is formally rearranged into an optimal number of subsets. Hierarchical clustering is based upon the notion of successively combining all stocks into groups. This method enables us to follow the dependencies be-

tween stocks and clusters using a tree shaped graph or dendrogram. A linkage function is used to quantify the distance between the formed groups, based on the distances computed by the similarity index. The average distance between all pairs of stocks in clusters If



k = 1, .., z

is used as distance function. containing




are two clusters, with with

c = d ∈ [1, .., z],





dened as

cci and cdj ,

i = 1, .., nc

j = 1, .., nd ,

the linkage function is

dened as follows.

1 d(cc , cd ) = nc nd



t(cci , cdj )
i=1 j=1


Based on the similarity index the stocks are successively grouped into increasingly larger groups. The clustering yields a tree-like structure in which the

distance between the clusters is indicated by the height of the branches. The procedure is repeated for a variable number of nal partitions, assessing each


time the quality of the groups using Dunn's index (Dunn, 1973). Dunn's index is dened as,

min(dist(cc , cd )) Dunn = (

cc ,cd

max diam(ck )



dist is the minimal distance between two elements of two dierent clusdiam
is the maximal diameter of a cluster. Dunn's index is used to

ters and

assess the cohesiveness of the found groups and should therefore be maximized. In our case however, the index is minimized as the distance and diameter are computed using the similarity index and not a traditional measure such as the Euclidean distance.

A data-set is constructed based on the components of the Standard and Poor's S&P100 index over a 5 year period, or 1250 days, starting on February 14, 2002. Data comprises daily closing prices and are then transformed into continuous daily returns by,

rit = log(Pit ) − log(Pit−1 )



is the closing price of security


on day


The S&P100 index com-

prises only U.S. blue chip corporations representing the ten 2-digit GICS industrial sectors. Two companies (Lucent Technologies and HCA) had to be


excluded as their stocks did not remain publicly traded over the entire considered period. Table 1 gives a full list of the selected stocks, together with their GICS class, the empirical moments and the results for the Jarque-Bera normality statistic. Table 1 shows that the returns are not normally distributed. Excess kurtosis is very much present. The period under study is representative of a global economic expansion driven principally by strong economic growth of which, for the rst time in history, almost every country proted. In the U.S. a signicant increase in

consumption, now reaching nearly 70% of GDP, drove the economic expansion. These facts are of singular importance as the sectoral dependencies should be marked and observable in such a period of economic expansion. The high demand for commodities and energy putting strain on supply, led to fast rising prots in the Energy and Materials sector. The Information Technology sector recovered from the bursting of the bubble in previous years and proted from increasing consumption. The incredible ination of credit, combined with the creation of additional nancial products, led to increased earnings for nancial institutions. One would expect homogeneity across specic sectors, such as the Materials, the Energy sector, the Financial sector and even Consumer Discretionary sector outperforming others. Our results show, however, a dierent picture. Sectoral homogeneity is in most cases far fetched while individual companies show high dependencies with companies outside their GICS sector. The con-


Table 1: Summary statistics for the components of the S&P 100 index of U.S. blue chip companies. 16



cept of the winning sector, which used to be popular in the nineties seems to be replaced by high performing groups of companies that cannot be found back in a specic GICS sector. To conne the analysis to sectoral dependencies, only blue chip corporations were selected. All corporations are U.S. based avoiding in this way country eects to inuence the results. Figure 2 presents the Pearson's correlation coecients. In order to make the correlation coecients legible, the matrix was converted into a color map. Correlations are generally low. Some stocks such as CBS, OceMax, Home Depot, Tyco or AES present low correlations with nearly all companies across all sectors, represented by dark lines running across the gure. The absence of signicant grouping or strong relations between corporations having very similar activities indicates less homogeneity than could be expected from blue chip companies. Indeed such homogeneity would have been visible along the main diagonal of Figure 2, with light colored rectangles for the dierent sectors. Such rectangles are, however, rare and usually where they exist they are accompanied by equally strong correlations with other sectors. Secondly, even though most sectors as dened by the GICS framework do not exhibit particular cohesion, some exceptions must be noted. Among them we nd the Financials rst of all and to a lesser extent, the Materials, Utilities and Energy sectors. The Financial sector, however, exhibits nearly equally

large correlations with for instance Industrial or Material companies (situated between numbers 58 - 71 and 84 - 89) . This indicates a potentially stronger


Pearson's Correlation Coecient: The correlation matrix of the considered sample of 98 stocks is converted into a color-map to improve legibility. Value vary between 0 and 1, no negative correlation coecients.
Figure 2:


relationship between these sectors than one could expect. Also the absence of strong dependencies inside the Health care sector is interesting. Similarly, the Industrials sector exhibits high correlations outside of its sector but mostly for isolated cases. General Electric for instance (67), is highly correlated with a number of nancial institutions. Similarly IBM or for example Du Pont, show higher correlations with nancial institutions than with companies of their respective sectors. Finally, all stocks are positively correlated. This is of course to be expected as all companies are U.S. blue chip companies and among the largest in the nation. At the same time, it implies that a relatively strong market tendency is present, something that most certainly will have an inuence on the found independent components.

Empirical Results and Robustness Test
The approach in the current paper diers greatly from the system that led to the construction of the GICS classes. Still some correspondence should be present between the classes dened in the GICS framework and those found here. A number of companies do have strong links based on their end products and production processes and therefore should be part of the same cluster. Let us rst look at the similarity index in gure 3. The index forms the basis of the clustering methods and already shows some dierences with correlations


Figure 3:

Similarity Index: Index computing the similarity between the series of reconstructed and ordered independent components, for each stock. The index takes values between 0 and 1. The matrix should be compared to the correlation matrix. Higher order moments are taken into account in ICA. Non-Gaussianity leads to dierent dependencies between stocks.


in Figure 2. The sectors that presented cohesion in the correlation matrix seem to do so as well for the Tulloss index. However, high similarity clearly goes beyond the borders of the individual sectors. When looking at the area between stock number 65 and 98 for instance, the dependencies seem to have increased for all companies regardless of the sector. Additionally between 35 and 65, a large number of companies seems rather independent from nearly all sectors. In this same area the correlation matrix indicated sector-like dependencies. The result of clustering according to this similarity matrix will therefore most probably have two characteristics. Firstly, the cohesion inside GICS sectors will probably be less important and be substituted for larger dependencies outside the sectors. Secondly, a number of companies will be clustered alone or in very small groups as they show relative independence on a general level.

Hierarchical Clustering
In order to identify which dependencies exist between the analyzed companies and how each of them relate to the others, a hierarchical clustering system is used. Each stock is considered individually at rst. Stocks are then grouped into clusters of two stocks based upon the similarity index. For subsequent

groupings the selected distance function is used. It computes the average distance between each of the clusters and groups them together. The succession of these steps leads to regrouping all the stocks into one nal cluster, the


Figure 4:

Dunn's Index: Evolution of Dunn's index in function of the number of clusters considered in the hierarchical clustering analysis. The graph is based upon the similarity index, which implies that Dunn's index should be minimized. Contrary to traditional distance metrics, this measure of similarity between the series of independent components is discrete. This explains the step-like nature of the graph.

sample itself. Figure 4 shows the evolution of Dunn's Index in function of the number of groups. The discrete nature of the similarity index, which is not a continuous distance metric like for instance the Euclidean distance, explains the stepwise evolution of the graph. According to Dunn's index twenty groups imply optimal clustering. Figure 5 shows the corresponding dendrogram constructed

on the basis of this clustering analysis. The height of each U-shape indicates


Figure 5: Dendrogram: Based upon the average distance between stocks and subsequent clusters, measured by the similarity index, the stocks are grouped in increasingly large groups. The height of the U-shapes indicates the dissimilarity between the two entities grouped together. The numbers on the horizontal axis correspond to the clusters.


the distance between the clustered stocks or groups. Table 2 gives the list of the companies grouped in each of the 20 clusters. In this nal classication a larger number of companies is clustered independently while others show strong relations with companies outside their GICS sectors. The largest cluster is the second one in Table 1 . It groups various segments of GICS sectors that possess large dependencies outside their GICS cluster conrming that GICS sectors capture only part of the non-Gaussian dependencies. These dependencies depend upon the dynamic nature of actual returns rather than the static criterion of production process or product. A number of GICS sectors are still partially present in cluster Nr 2: Financial companies are: investment banks together with American Express, the main producers of personal computers and the two major oil producers combined with the two main oil eld services companies. These companies have seen

their prots increase from rising oil consumption, oil prices and in the case of the service suppliers, based on the increased search for oil by especially Exxon Mobil and Chevron. The Materials companies in cluster Nr 2 are the two main paper producers and Du Pont, specialist in synthetic plastics. Similar observations can be made for the other major clusters. Group Nr 3 clusters, like group Nr 2 various companies with specic links. Ford and

GM are part of this cluster, AIG and Hartforts, the two major insurance companies, as well as all the producers of semiconductors, are part of this subset. The major beverage producers, Coca-Cola, Pepsi and Anheuser-Busch


Table 2: Dendrogram clusters: Each number corresponds to a cluster in the dendrogram. Companies are indicated together with ticker and GICS sector. Twenty clusters are constructed based on the minimization of Dunn's Index.




form a separate cluster with P&G. A striking observation, as the nancial crisis unfolds, is that the dependencies found using our method are relevant even in a rapidly declining market. The most recent observation is dated February 14th, 2007. Health care companies specializing in drugs and research into pharmacological products form a separate cluster, Nr 13. The other health care companies, more specialized in medical equipment are clustered separately. Interestingly, Johnson and Johnson is not part of the main Health care cluster. The formed clusters do show strong links based upon the end products. However, the important distinctions made between sectors in frameworks like GICS, are much less relevant. A possible explanation comes from the rigidity of the GICS,

which by adapting only slowly to changes in the activities of companies, does not capture changes that bring some sectors closer together. The nancial

sector for instance has strong ties to those companies for which it underwrites debt. These ties are temporary, but highly signicant. Alongside these large clusters, a large number of small clusters can be found. Most of them consist of individual companies. One might think that these

individual companies are particular cases that split of from large clusters. The contrary appears when looking at the dendrogram. stocks link up to other individual stocks. Most of the individual

Raytheon, a large manufacturer

of high-tech goods mainly, often for national defense purposes, links up with General Dynamics, another company producing similar end products. Some


icons of American consumption such as McDonald's or Heinz stand alone. Heinz links-up to the cluster containing Coca-Cola, Pepsi, Proctor and Gamble and Arnheuser-Busch. These companies produce cyclical consumer goods and enjoy a solid position in investor's opinion. comparing these last results to the GICS framework, two main observations emerge. First, the end product and production clustering in the GICS framework seems to be a reasonable criterion up to a certain extent. Non-parametric similarities lead to the same kind of dependence among a large portion of the stocks considered. The end product criterion inuences investors and will

have the eect of grouping stocks together. But it is, however, insucient to reveal the real dependencies. Secondly, the notion of industry group is less

present in our clusters. Instead, companies are grouped into much larger or much smaller groups. Notoriety, like the case of Heinz and Coca-Cola, also

inuences the formation of clusters. Our classication helps tackling the question of winning industry versus winning rm. In the 1990's, investors were looking for sectors outperforming others. Nowadays, this notion has less importance as it seems individual companies can outperform an entire sector. These rms are aected by dierent non-Gaussian eects than most of the companies inside their GICS class. The conclusions drawn from the hierarchical clustering are instructive. First, homogeneity of the GICS classes is not conrmed by our results. In

very limited cases, such as a fraction of the Health Care sector or the Utili-


ties sector, only rms of a particular sector form a separate sector. The most observed pattern is either large groups containing similar companies from different sectors, or very small groups with individual companies or pairs. The links existing between the small groups and the larger ones are generally fairly weak. Large clusters seem to have more dependence with other large clusters.

Robustness Tests
In order to investigate the robustness of the classication presented in this paper, the S&P sample was split into two equal sub-samples. Each sub-sample has 47 rms representing each GICS sector. Each sample has the same number of companies for each GICS class. Independent components are re-estimated and the similarity index is calculated. The two sub-samples are than clustered into twenty clusters, again an optimal number of clusters, using hierarchical clustering. The results can be found in Figure 6 and Table 3. Dunn's index is minimized to nd the optimal clustering. As the clustering is based on

non-Gaussian characteristics present in the multivariate distributions of the analyzed securities and the estimation of the independent components is based upon maximizing the projection of a weight vector, it is obvious that the presence of strongly non-Gaussian eects of some stocks will overshadow those of other stocks. Less dominant non-Gaussianities might play a bigger role when the sample composition is changed. However, as all independent components


Figure 6: Dendrogram: Sub-sample 1 contains 47 stocks selected from the principal test sample.
The sample represents one half of the complete sample. An equal number of stocks per sector are present in both sub-samples.

are present, with more or less importance, in all stocks, it is foreseeable that strong dependencies will remain present when the sample is changed. The

presence of a number of such dependencies should therefore be considered as an indication of the robustness of the clustering method. The results for sub-sample 1 are presented in Figure 6 and Table 3. First let us consider the left hand corner of the rst dendrogram, made up of clusters 20, 15, 4, 8, 1, 2 and 5. These clusters regroup a large quantity of stocks found in the two large clusters of the full sample. As these large clusters 2 and 3


Table 3:

Clusters corresponding to dendrogram of sub-sample 1

show a rather strong relation, it is logical to nd the various stocks in the left branch of the dendrogram. Among those stocks we nd the Financials, Energy and Consumer Discretionaries such as the investment banks, Exxon Mobil and Chevron and companies like Disney and GM. The cohesion observed between the Health care companies in the full sample is less visible. The IT companies that were also part of clusters 2 and 3 of the full sample are again found closely clustered together in the left branch of the dendrogram. Some companies that were alone in the full sample remain alone in the rst sub-sample. Sara Lee, Campbell Soup and Allstate are among those. Some relations between companies might be overshadowed and resurface when the sample is split into two parts. At the same time the average number


Figure 7:

Dendrogram: Sub-sample 2 contains 47 stocks selected from the principal test sample. The sample represents one half of the total sample. Each sub-sample has the same general structure with respect to the GICS framework.
of rms for each cluster is now halved. The clusters found in large sample are split along the lines of the weakest links.

The conclusions found for the rst sub-sample are also true for the second sub-sample in Figure 7 and Table 4. Like the rst sub-sample, the rms

exhibiting strong dependencies are clustered in a variety of groups in the left branch of the dendrogram. This branch equally shows small distances between the groups. Particularly Microsoft, EMC Corp, Xerox and Oracle present high similarity and are classied closely. Similar strong relationships are found for


Table 4:

Clusters Corresponding to dendrogram of sub-sample 2

the Industrials and Financials, grouped together in the left branch.

Much research has been dedicated to the objective of minimizing risk exposure of portfolios by adequate stock clustering. Equity has been grouped according to style, economic or industry aliation, country or residence. Purely statistical methods have also been used, in which eld Elton and Gruber (1970), Farrell (1974), Brown and Goetzman (1997) or Kao and Schumaker (1999) are the principal contributors. The method proposed and tested in this paper is based on non-Gaussian similarity between stock returns.


The GICS framework classies stocks according to end products, production processes and investor perception of the rms. It has received increasing attention, in particular among practitioners. ans a benchmark in our study. Given the non-Gaussian nature of equity returns, as indicated among others by (Kon, (1984); Mills, (1985); Peiro, (1999); Premaratne and Bera, (2000) and Patton, (2004)), fundamentals driving equity returns can be retrieved by Independent Component Analysis. All particular dependencies, whether The GICS framework is taken

situated in the tail or, e.g., in the asymmetry of the distribution are taken into account. Retrieving similarity between sets of components that have the highest inuence in the reconstruction of the considered stock is the basis of the approach described here. Clustering stocks using the method presented in this paper has several advantages. First, the method depends only on stock returns and their characteristics while being non-parametric. This implies exibility and adaptability. The dependencies are not rigid in time and take into account the situation and the dependencies between stocks as they are perceived by the market. When a crises strikes a number of companies or an entire sector, the possible contagious eects, expressed in non-Gaussian dependencies between stocks will be captured and integrated into the mechanisms for grouping equity. When a portfolio is set up to avoid the eects of crises, as suggested by Brière and Szafarz (2008) for a bond portfolio, rebalancing it could be prevented. How-


ever when major dependency shifts occur, the stability of equity clusters as a whole might be considerably reduced. New data change the dependency

structure and thus the clusters. With our method, practitioners have a tool to accommodate for such situation. When constructing a portfolio, or when it is rebalanced in the face of changing economic or market conditions, portfolio managers should certainly consider the advantages the present method oers. Comparing this new method to the traditional GICS framework some differences are noted. The clustering analysis has shown that GICS captures However, it falls

non-Gaussian dependency structures in S&P 100 stocks.

short due to its is its inability to quickly adapt to changing market realities. As a crisis might spread from one rm to another, from one sector to another, our method oers the possibility to group stocks based on their non-Gaussian dependence. The GICS framework can thus be seen as good foundation on

which to build an initial analysis of the market and the potential portfolios that can be constructed. But before deciding in which stocks to invest it would indeed be wise to cluster the considered universe using the above presented method. Some elements could still be improved in order to gain accuracy in the sorting algorithm. Independent components tend to capture very specic events common to a number of companies. In order to gain additional stability of

the system, higher frequencies could be removed from the returns using lters before performing ICA. Medium term and trend eects would in that case de-


termine the nature of the independent components and ultimately the nature of the groups. These ltered components would be more stable over time, as it is likely that stock returns are aects by only a small set of stock specic effects. Consequently, the clusters found would remain more similar over time. A number of components explaining the cross-section of the returns is identied, i.e., components representing the market, which are subsequently separated from those representing more idiosyncratic events. Clustering methods using only stock specic components could be the result of such amelioration's. The here presented method could be applied to other areas of nance. Dependencies in real estate markets or between property types is a possible extension. As the economic climate evolves it is very possible that dependencies between real estate markets shift. Capturing such shifts is made possible through our approach. Similarly, country or sector inuences with respect

to diversication of portfolios could be analyzed using ICA based clustering algorithms. Past research (Ross (1992), Beckers, Connor and Curds (1996) Heston and Rouwenhorst (1995) and L'Her, Sy Tnani(2002)), seems inconclusive whether diversifying across country or sector provides greater risk reduction. The present paper suggests that sectors change with the conjuncture. Eco-

nomic changes lead to variations in the coherence of industry sectors, as dened by for instance the GICS framework. Our empirical results show indeed that a small number of sectors exhibit real coherence, such as the Financial sector, however, even these entities seem rarely independent. The conclusion is


that either the sector should be considered as variable entities, or the number of sectors should be adjusted to those showing real cohesion complimented by vast group of non-specied companies. Globally, a valid way of grouping equities into classes should be to limit its ex-ante assumptions to a minimum while taking into consideration the historical stock returns in order to capture as much information as possible about the rms evolution. In an ecient market, stock prices are the best

source of information from which the real dependencies between companies can be inferred. It is therefore important to capture this information across all moments of the distributions of equity return through data analysis without predetermining a general distribution of the equity returns. Any attempt to group sectors should be exible. Strict sectors in which companies remain

unaected by their performance and dependencies upon other stocks reveal to be sub-optimal in the quest for diversifying sector specic risks. The suggested clustering system incorporates this idea and could lead to better classications of equity.

[1] Baca, S., B. Garbe, and R. Weiss (2000), "The Rise of the Sector Eects in Major Equity Markets", Financial Analysts Journal , Vol. 56, pp. 34-40. [2] Back, A. and A.S. Weigend (1997), "A First Application of Independent Component Analysis in Finance", International Journal of Neural Sys-

tems , Vol. 8, No. 5, pp. 473-484.


[3] Beine, M., Preumont P.-Y., Szafarz A. (2006), "Sector diversication during crises: A European perspective", Working Paper DULBEA, Research

series, N°06-07.RS , May.
[4] Beckers, S., Connors, G., and Curds, R. (1996), "National versus global inuences on equity returns", Financial Analysts Journal , Vol. 52, pp. 31 39.


Brière, M., Szafarz, A. (2008), Crisis-Robust Bond Portfolios, The Jour-

nal of Fixed Income , Fall, Forthcoming.
[6] Brown, S. and W. Goetzmann (1997), "Mutual Fund Styles", Journal of

Financial Economics , Vol. 43, No. 3, pp. 373-399.
[7] Cavaglia, S., C. Brightman, and M. Aked (2000). "The increasing importance of industry factors", Financial Analysts Journal , Vol. 56, No. 5, pp. 41-54. [8] Cha, S.M and Chan, L. (2000), "Applying Independent Component Analysis to Factor Model in Finance", Intelligent Data Engineering and Auto-

mated Learning - IDEAL 2000 , Springer, pp. 538-544.
[9] Chan, L., J. Lakonishok and B. Swaminathan (2007), "Industry classication and Return Comovement", Financial Analysts Journal , Vol. 63, No. 6, pp. 56-70.


Cheung, Y. and Xu, L. (2001), Independent Component Ordering in ICA Time Series Analysis, Neurocomputing, Vol 41, pp. 145-152.

[11] Christopherson, J. A, (1995), "Equity Style Classication", The Journal

of Portfolio Management, Vol 21, pp. 32-43.
[12] Clarke, R.N. (1989), "SICs as Deliminator of Economic Markets", Journal

of Business , Vol. 62. No. 1, pp. 17-31.
[13] Connor, G. (1995),"The Three Types of Factor Models: A Comparison of Their Explanatory Power." Financial Analysts Journal , Vol. 51, No. 3, pp. 42-46.


[14] Cristescu R., J. Joutsensalo, J. Karhunen and E. Oja (2000), "A Complex Minimization approach for Estimating Fading Channels in CMDA Communications", in Proc. Int. Workshop on Independent Component Analysis

and Blind Source Separation , Helsinki, Finland.
[15] Deakin, E.B. (1976) , "Distributions of Financial Accounting Ratios: Some Empirical Evidence." The Accounting Review , January 1976, pp.9096.


Dunn, J.C. (1973), A Fuzzy Relative of the ISODATA Process and its Use in Detecting Well-Separated Clusters, J Cybernetics, Vol. 3, pp. 32-57.

[17] Elton, E. and M. Gruber, (1970) "Homogeneous Groups and the Testing of Economic Hypothesis", Journal of Financial and Quantitative Analysis , Vol. 4, No. 5, pp. 581-602. [18] Fama, E. and K.R. French (1997), "Industry Cost of Equity", Journal of

Financial Economics , Vol. 43, No. 2, pp. 153-193.
[19] Farrel, J. (1974), "Analyzing Covariation of Returns to Determine Homogeneous Stock Groupings", Journal of Business , Vol. 47, No. 2, pp. 186-207. [20] Hamelink F., H. Harasty, and P. Hillion,( 2001),  Country Sector or Style: What Matters Most When Constructing Global Equity Portfolios? Empirical Investigation from 1990-2001 , Working Paper , FAME. [21] Heston, S. L., and Rouwenhorst, K. G. (1995), " Industry and country eects in international stock returns", The Journal of Portfolio ManageAn

ment , Vol. 21, pp. 53 58.
[22] Hyvärinen A. and E. Oja, (2000), "Independent Component Analysis: Algorithms and Applications", Neural Networks , Vol.13, No.4-5, pp. 411430. [23] Hyvärinen, A. and E. Oja (1997), "A Fixed Point Algorithm for Independent Component Analysis", Neural Computation , Vol. 9, pp. 1483-1492. [24] Hyvärinen, A.(1999), " Survey on Independent Component Analysis",

Neural Computing Surveys , Vol. 2, pp. 94-128.


[25] Hyvärinen, A., J. Karhunen, and E. Oja, (2001), Independent Component

Analysis , John Wiley and Sons, New York.
[26] Jutten, C. and J. Hérault (1991), "Blind Separation of Sources, Part I, An Adaptive Algorithm Based on Neurometric Architecture", Signal

Processing , No. 24, pp. 1-10.
[27] Kao D.-L. and R. D. Schumaker, (1999), "Equity Style Timing", Financial

Analysts Journal , Vol. 55, pp. 18-37.
[28] Koedijk, K., E. Kole and M. Verbeek, (2006) "Selecting Copulas for Risk Management," CEPR Discussion Papers 5652 , C.E.P.R. Discussion Papers. [29] Kon, S.J.(1984),"Models of Stock Return - A Comparison" Journal of

Finance , Vol. 39, pp.147-165.
[30] L'Her, J., Sy, O., and Tnani, M. Y. (2002)," Country, industry and risk factor loadings in portfolio management.", The Journal of Portfolio Man-

agement ,Vol. 28, pp.70 79.
[31] Michaud, R. O. (1998), "Is Value Multidimensional? 7, No. 1, pp. 61-65. [32] Mills, T.C. (1995), "Modeling Skewness and Kurtosis in London Stock Exchange FTSE Index Return Distributions", Statistician , Vol. 44, pp. 323-332. [33] Patton, A.J.(2004), "On the Out-of-Sample Importance of Skewness and Asymmetric Dependence for Asset Allocation", Journal of Financial Implications for

Style Management and Global Stock Selection", Journal of Investing , Vol.

Econometrics , Vol. 2, pp. 130-168.
[34] Peiró, A. (1999)," Skewness in Financial Returns", Journal of Banking

and Finance , Vol. 6, pp. 847-862.
[35] Premaratne, G., and A. K. Bera (2000),"Modeling Asymmetry and Excess Kurtosis in Stock Return Data", Working Paper 00-0123 , University of Illinois.


[36] Roll, R. (1992), "Industrial Structure and Comparative Behavior of International Stock Market Indices", Journal of Finance , Vol. 47, No. 1. pp. 3-41. [37] Speidell L. S, and J. Graves (2003), " Are Growth and Value Dead?: A new Framework for Equity Investment Styles", The Handbook of Equity Style Management, Edited by Coggin T.D. and Fabozzi F.J. (2003), John Wiley , New Jersey, USA. [38] "Stein, B., S. Meyer zu Eisen and F. Wissbrock (2003),"On Cluster Va-

lidity and the Information Need of Users ", 3rd IASTED Int. Conference
on Articial Intelligence and Applications (AIA 03), Benalmádena, Spain, September, Edited by M. H. Hanza. pp. 216-221." [39] Tulloss, R. E., (1997), "Assessment of Similarity Indices for Undesirable Properties and a new Tripartite Similarity Index Based on Cost Functions", in Palm, M. E. and I. H. Chapela, eds. Mycology in Sustainable

Development: Expanding Concepts, Vanishing Borders .(Parkway Publishers, Boone, North Carolina), pp. 122-143. [40] Vessereau, T. (2000), "Factor Analysis and Independent Component Analysis in the Presence of High Idiosyncratic Risk", Série Scientique

N° 46 , Cirano, University of Montreal, Canada.
[41] Vigário, R., V. Jousmäki, M. Hämäläinen, R. Hari and E. Oja (1998) "Independent Component Analysis for the identication of Artifacts in Magnetoencephalographic recordings",In Advances in Neural Information

Processing Systems , MIT Press, Vol. 10, pp. 229-235.
[42] Vigário, R., J. Särelä, V. Jousmäki, M. Hämäläinen and E. Oja (2000) "Independent Component Approach to the Analysis of EEG and MEG recordings", IEEE trans. Biomedical Engineering , Vol. 47, No. 5, pp. 589593.


Independent Component Analysis: Extraction of Independent Components

The extraction of independent components is based upon maximizing nonGaussianity within a multivariate random variable, or in the present case a the multivariate distribution of stock returns. Non-Gaussianity can be measured using several indicators including kurtosis, mutual information, and negentropy. The latter is a measure for uncertainty associated with random

variables. More precisely, the negentropy random variables

J(si ) of a of continuous multivariate


is dened as,

J(si. ) = H(gi ) − H(si. )
with the entropy equaling,


ˆ H(si. ) = −

ps (si. ) log ps (si. ) ps (.)and gi



is a multivariate random variable with density

is the

Gaussian random variable with the same expectation and covariance matrix as

si. .

In practice the xed point algorithm FASTICA by Hyvärinen, Karhunen

and Oja (2001) used in this paper uses the gradient of the approximation of



J(si. ) ≈ E G(wT ri. ) − E {G(gi. )}




is a non-quadratic function and


is a standardized Gaussian multi-

variate variable. The algorithm converges by choosing a weight vector that the projection of



wT ri. = si.

is maximally non-Gaussian.


Similarity Index: Tulloss (1997)
¯ Si = (¯(q) ) s

Given are

¯ Sj = (¯(q) ) s


j = i ∈ [1, ..., n],

two lists of ,

weighted and sorted independent components.

a =

s(i) = s(j) ¯ ¯

b =

¯ ¯ / ¯ s(i) ∈ Si s(i) ∈ Sj ¯

and let


¯ ¯ / ¯ s(i) ∈ Sj s(i) ∈ Si ¯

. Using the tripartite

similarity index by Tulloss (1997) we quantify the similarity between two stocks

ri , rj

as the value


given by,

tij =

fA (q) × fB (q) × fC (q)


fA , fB



are three cost functions which make up the index.

log 1 + fA (q) =

min(b,c)+a max(b,c)+a

log 2 1
log{2+ a+1 log 2


fB (q) =




fC (q) =

log 1 +

a a+b

· log 1 + (log 2)2

a a+c


The three functions determine a cost for the number of components the two series share, the number of components they do not share and the dierence in length of the two series that are compared. These features greatly increase the accuracy of the index. The index always has a value greater or equal to zero and smaller or equal to one, with zero when no similarity is discovered and one when the two series are equal. Once the cost function is applied, a distance matrix

T = (tri ,rj )

is obtained.

tri ,rj

quanties the distance in terms

of dependence on similar independent components between stock


and stock



Shared By:
Description: Standard sector classication frameworks present drawbacks that might hin- der portfolio manager. This paper introduces a new non-parametric approach to equity classication. Returns are decomposed into their fundamental drivers through Independent Component Analysis (ICA). Stocks are then classied ac- cording to the relative importance of identied fundamental drivers for their returns. A method is developed permitting the quantication of these depen- dencies, using a similarity index. Hierarchical clustering allows for grouping the stocks into new classes. The resulting classes are compared with those from the 2-digit GICS system for U.S. blue chip companies. It is shown that specic re- lations between stocks are not captured by the GICS framework. The method is applied on two dierent samples and tested for robustness.
JFEI Nicol JFEI Nicol Technology