VIEWS: 0 PAGES: 30 CATEGORY: Lifestyle POSTED ON: 1/18/2010
Web Intelligence O & A in Bioinformatics Graphs and Networks II This is a lecture for week 7 of `Web Intelligence And Ontologies & Algorithms in Bioinformatics’ Some images in this lecture are from the Stroglatz paper. This Week Degree distributions Cluster Coefficient and Cluster Function Small-world and scale-free networks Modularity and Hierarchy Degree Histograms (mainly we will be talking about undirected networks) Recall the degree of a node: A B A has degree 3, B has degree 2, C has degree 2, D has degree 1 C D 3 2 The degree histogram of this tiny graph is: 1 I.e. it has 0 nodes of degree 0, 1 of degree 1, 0 2 of degree 2, 1 of degree, 3, and 0 of degree >3. 0 1 2 3 4 5… You might also think of the degree histogram as a table, e.g.: Degree: 0 1 2 3 4 5 Frequency. 0 1 2 1 0 0 Degree distributions Degree: 0 1 2 3 4 5 Frequency 0 1 2 1 0 0 Distribution. 0 0.25 0.5 0.25 0 0 The degree distribution is a function P(k), which gives the probability of a randomly chosen node from the graph having degree k. What is the degree distribution of the complete graph on 1000 nodes? Imagine I have a graph with 1000 nodes, but no links. Now I start adding links randomly, one by one. After 10 random additions, what do you expect the degree distribution to be? What will the average node degree be after 1000 additions? Example degree distributions P(k) 0.3 The standard situation in a network where 0.25 links are added completely at random. If 0.2 there are n nodes, and m edges randomly 0.15 added, then the peak of this is at 2m/n, the P(k) 0.1 average degree. So, for a randomly picked node, the most 0.05 likely degree is the average one. The 0 probabilities then drop quickly either side. 1.00 3.00 5.00 7.00 9.00 11.00 13.00 15.00 17.00 The directorships figure from Stroglatz. Notice the stretched out tail. Unlike random graphs, there are quite a few very highly connected nodes. Consider what this means. A few people have influence over many companies. These just might be very busy people, or controllers. What kind of person might have 20 co-directors, rather than 40? The Tails As you know, it is the tails of the degree distribution that seems interesting. Some notes: In real world networks, these tails are much fatter and longer than in random networks of the same size. It seems that, in this tail region, P(k) follows a power law – that simply means that the way the probability decreases with k seems to be a reasonably close fit to k for some , i.e.: P(k ) k But if so, note that : P(k ) k log( P(k )) log(k ) So, if there really is a good fit between the tail and a power law, then when we plot log(P(k)) against log(k) we should get a straight line sloping downwards towards the right. Power law for exponents -1 to -3 0.09 0.08 0.07 -1 0.06 -1.5 0.05 P(k) -2 0.04 0.03 -2.5 0.02 -3 0.01 0 12 14 16 18 20 22 24 26 28 30 k A 10-node 10-edge Random Graph B A I H D C G J E F Degree distribution: 0.4 0.3 0.2 0.1 Longest Shortest path: DF 6 A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F We will build this by a Preferential Attachment process: The chance of a new edge incident at a node increases with the degree of that node. A 10-node 10-edge Small-World Graph B A I H D C G J E F Degree distribution: 0.6 0.2 0.0 0.1 0.0 0.1 Longest shortest path: AE 4 Notice that: This was only a tiny example, but illustrative … The degree distribution of the graph generated by Preferential attachment had a longer tail. It’s diameter was smaller. PA leads to graphs that have many high degree nodes – such a node is therefore one hop away from many others. So, a short path is usually available via such a node. Such graphs with small diameter (longest shortest path, or average shortest path) are called small world networks. Such networks also seem highly clustered. Scale-free networks We have learned that: P(k ) k for many real networks. I.e. real networks seem to have a long `tail’ in their degree distribution, with significant numbers of nodes having high degree. In a random network, most nodes will have their degree close to the average. So there is a characteristic or typical degree. But this is not the case if the power law (above) holds. There is no `typical’ degree. The range of degree values varies very greatly – so such a network is called scale-free Clustered (or modular) graphs This graph is clearly clustered – there are groups (clusters) of nodes that are highly interconnected amongst themselves, but have few connections to other clusters. Would such a graph tend to have a high or low diameter?, `Hierarchical’ graphs This graph on the left is called modular. The graph on the right is also clearly modular. E.g. there are three distinct modules (the things that are copies of the graph on the left). However, each of these modules seems to have a modular structure of its own. This is called hierarchical modularity. More metrics So far we can characterise graphs by: •Number of nodes •Density (number of edges divided by number of possible edges) •Average path length, longest shortest path length (diameter) •Degree distribution. But we need more (graphs which are the same in all these respects could still be different in terms of the modular and hierarchical aspects of their structure). To capture these aspects, there are: •Cluster coefficient •Cluster function Defined next … The Cluster Coefficient B A I H D C G J E F Consider node B. It has 5 neighbours (can you define `neighbour’?): D, G, J, C, I Every distinct pair of neighbours (there are 5 x 4 / 2 = 10 distinct pairs) forms a potential triangle with B. The triangle BJCB exists, because edge CJ exists. But none of the other 10 exist. The cluster coefficient of node B is 1/10. What is the CC of node C? The Cluster Coefficient: a proper definition Suppose node i has n i neighbours. Therefore there are ni ( ni 1) / 2 possible triangles (edges that link node i’s neighbours) Suppose t i of these edges are in the graph. The clustering coefficient 2ti / ni (ni 1) ni of node i is defined as: The mean of this for a graph is called the CC of the network, C. I.e.: N C 1 / N ci where ci is the cluster coeff. of node i, and N is i 1 the number of nodes in the graph. Some related things … The Cluster Function (with respect to node degree) C(k) is defined as follows: In words: C(k) is the mean cluster coefficient over all nodes with C (k ) 1 / | N k | ci degree k. ni iN k Where Nk is the set of nodes with degree k. (Note that the cluster function leads to a distribution) A high C (in comparison to random graph of same size) indicates modularity From Albert & Barabasi, 02 Notice : Cluster coeffcients for these graphs are much higher than for equivalent random graphs. Indicates modularity? Most of them display the small world property. In some cases the average path length may be longer than in a random network, but dense random networks have the small world property anyway. The power grid graph has much longer paths than the equivalent random graph. Why? The exponents of the power law tail seem to vary between 1 and 3. log P(k) log P(k) 1 log P(k) 2 3 log k log k log k The lower lambda, the larger the number of highly connected nodes and the larger the range of degrees Some interesting facts If the Cluster Function follows a power law (i.e. the cluster function C(k) falls with k-lambda for some lambda) then this is evidence for a hierarchical modular structure Highly connected nodes are called hubs. The power law exponent reveals something about the importance of hubs in a given network. If > 3, the tail is short and hubs are few and not very heavily connected. For lambda between 2 and 3, this suggests a hierarchy of hubs, with the most heavily connected hub being connected to a relatively small fraction of the other nodes, but many of these will be hubs themselves. Lambda <= 2 suggests hubs that connect to large fractions of the nodes, acting like control centres. Scale-free networks in general are robust to damage, however the presence of hubs (especially when lambda is …?) suggests vulnerability Assignment 2 Read Paper and provide a 4-slide presentation that conveys the main points to busy scientists. Web Intelligence people: Paper = Diameter of the WWW OAB people: Paper = Graph theoretic analysis of protein interaction networks of eukaryotes Marking: Completeness (1), Presentation (2), Brevity (1), Wow (1)