e-Business Value Chain Business Model

e-Business Value Chain & Business Model 최 인 준 injun@postech.ac.kr http://dke.postech.ac.kr 포항공과대학교 산업공학과 Data & Knowledge Engineering 연구실 POSTECH I.E. Data & Knowledge Engineering Lab. 상거래 프로세스 1 • 시장/매장 형성: 구매자와 판매자의 접촉 – Communication: advertising and marketing – Intermediaries (dealers, distributors, reps) • 흥정/협상 (Negotiation) – Agents (buyers, lawyers) • 주문/계약 (Transaction) – Order or Contract – Payment (or many payments) POSTECH I.E. Data & Knowledge Engineering Lab. 상거래 프로세스 2 • 주문처리 (Order fulfillment) – Manufacture – Delivery • 사후관리 (Post-sale events) – Customer service – Reorder, restock • 회계 (Accounting) – Are we making a profit? • 분석 (Data analysis) – Who’s buying the stuff? POSTECH I.E. Data & Knowledge Engineering Lab. 상거래의 전자화 기회/기술 1 • 어떤 부분을 전자화할 수 있나? • 시장/매장 형성 – Communication (via Networks, the Internet, Programming and Information Retrieval) – Human-Computer Interaction, Multimedia – Intermediaries (새로운 매개자) • Disintermediation • 흥정/협상 – Electronic Negotiation, Intelligent agents POSTECH I.E. Data & Knowledge Engineering Lab. 상거래의 전자화 기회/기술 2 • 주문/계약 – Transaction processing, Databases & ERP – Electronic Payment Systems – Computer Security – System Reliability • 주문처리 – Manufacture (manufacturing systems) – Delivery (tracking systems) POSTECH I.E. Data & Knowledge Engineering Lab. 상거래의 전자화 기회/기술 3 • 사후관리 – Customer Service and Help Facilities – Reorder, restock • 회계 – Transaction processing – Interoperability between online and legacy systems • 분석 – Data Mining POSTECH I.E. Data & Knowledge Engineering Lab. e-Business 프로세스, 기술, 정보 SOME TECHNOLOGIES USED: SEARCH ENGINE ON-LINE CATALOG RECOMMENDER AGENT CONFIGURATOR SOME INFORMATION GATHERED: BUYER LOCATES GOODS SELECTION OF GOODS NEGOTIATION SALE PAYMENT DELIVERY ON-LINE PROBLEM REPORTS SEARCH BEHAVIOR BROWSING BEHAVIOR CUSTOMER PREFERENCES EFFECTIVENESS OF PROMOTIONS BARGAINING STRATEGIES PRICE SENSITIVITIES PERSONAL DATA MARKET BASKET SHOPPING BOT AGGREGATOR AUTOMATED AGENTS TRANSACTION PROCESSOR DATA INTERCHANGE CRYPTOGRAPHY E-PAYMENT SYSTEMS TRACKING AGENT ON-LINE HELP BROWSER SHARING CREDIT/PAYMENT INFORMATION DELIVERY REQUIREMENTS POST-SALE SERVICE CUSTOMER SATISFACTION FOLLOW-ON SALES OPPORTUNITIES INTERNET TELEPHONY POSTECH I.E. Data & Knowledge Engineering Lab. e-business & Value Chain (Oracle) 거래 주체 : Suppliers Employees Resellers Customers Buy Side Electronic Procurement System Collaborative Supply Chain Management Customer Relationship Management On-line Shopping System Sell Side Back Office / ERP Application 인터넷 전자상거래 Buy Side 시장확대 수요창출 마케팅 Sell Side 고객만족 서비스 관리 경쟁우위 기획 / 개발 (제품 / 서비스) 비용효율 프로세스개선 물류최적화 비용시간절약 매출증대 구매 / 조달 (생산재 / 관리재 / 서비스) 생산 / 제조 (제품 / 서비스) 유통 재고관리 판매 영업 주문관리 POSTECH I.E. Data & Knowledge Engineering Lab. e-Business와 Value Chain (POEM) 거래 주체 Customers Resellers Competitor Bank Suppliers Employees Customers Internet 구매 / 조달 (생산재 / 관리재 / 서비스) 생산 / 제조 (제품 / 서비스) 수요창출 마케팅 기획 / 개발 (제품 / 서비스) 판매 영업 주문관리 지불 유통 재고관리 서비스 관리 CRM On-Line Shopping e-Payment SCM e-Mfg. e-Logistics Call-Center/ QoS E R P 정보 시스템 POSTECH I.E. Data & Knowledge Engineering Lab. e-Business Value Chain 기법/기술 • Marketing: DB Marketing, 1:1 Marketing, CRM & eCRM • Sales: On-line Catalog, e-Shopping_Cart, … • Payment: e-Payment System (e-Cash, e-Check, Smartcard, …) • Procurement & Purchasing: SCM, Marketplace, … • Manufacturing & Management: ERP, e- Manufacturing, … • Logistics & CS/QM: e-Logistics, Call-Center, QoS, 6-sigma, … POSTECH I.E. Data & Knowledge Engineering Lab. e-Business 기반/요소 기술 • • • • • Internet Mobile technologies Web architecture Data interchange Multimedia • • • • • Access security Cryptographic security Search engines Data mining Intelligent agents • Databases • Workflow Management • Knowledge Management POSTECH I.E. Data & Knowledge Engineering Lab. 3.4 e-Business Solutions • Enable good managers and business owners to build, manage and maintain an e-business • Web-site building services • e-Consulting • Marketing POSTECH I.E. Data & Knowledge Engineering Lab. 3.4.1 End-to-End e-Business Solutions • End-to-end solution provider – Offers services to build Web sites from conception to implementation • Design, development and deployment services • Payment capabilities • Web-site monitoring services • Back-end adaptation • Fulfillment • Data management POSTECH I.E. Data & Knowledge Engineering Lab. 3.4.1 End-to-End e-Business Solutions • End-to-end solution providers – – – – – – – – – – – Webvision Microsoft’s bCentral ROIDirect’s Ecommerce Dell E Works Genuity Interland Appnet Sapient Scient Viant Proxicom Data & Knowledge Engineering Lab. POSTECH I.E. 3.4.2 Other e-Business Solutions • Exist for e-business development, operation and management • Solution providers – Openair.com – – – – – – – Intacct BAport Accounting Netledger BizTone Financials Allaire Spectra Mediasurface InfoOffice SM Data & Knowledge Engineering Lab. POSTECH I.E. 3.4.3 Maintaining and Monitoring Your Web Site • Balanced Scorecard – A method used to measure the success of a business by its performance in customer satisfaction, integration capabilities and potential for growth – An e-business must also consider its use of current technologies for management and production purposes • Monitoring software and services – Mercury Interactive, ebSure, Inc., Akamai, iSharp.com, Holistix, Keynote.com, Site Rock, Red Alert POSTECH I.E. Data & Knowledge Engineering Lab. 3.4.3 Maintaining and Monitoring Your Web Site Holistix’s Web Manager. (Courtesy of Holistix, Inc.) POSTECH I.E. Data & Knowledge Engineering Lab. 3.4.4 e-Commerce Consulting • Guide developing e-businesses • Consulting services – – – – – – – – – Accenture (formerly Andersen Consulting) iPlanet SAP Sun Microsystems Kintana Xpedior Ernst & Young Deloitte & Touche eRunway Data & Knowledge Engineering Lab. POSTECH I.E. Data Warehousing, OLAP, and Data Mining Injun Choi POSTECH I.E. Data & Knowledge Engineering Lab. 정의 • 데이터웨어하우징 (Data Warehousing) – • 데이터의 수집 및 처리에서 도출되는 정보의 활용에 이르는 일련의 프로세스 데이터웨어하우스 (Data Warehouse) – – 의사결정지원시스템의 기반으로 사용되는 읽기 전용 분석 데이터베이스 기업 전체의 전략적 관점에서 효율적인 의사결정 지원을 위하여 데이터의 시계열적인 축적과 통합을 목표로 하는 데이터의 저장고 • OLAP (OnLine Analytic Processing) – – 데이터의 다차원 분석을 통한 정보추출 프로세스 기존의 SQL과 같은 가설 확인 중심의 조회 방식 • 데이터마이닝 (Data Mining) – 대량의 데이터베이스로부터 과거에 인지하지 못했던 의미있고, 근거있는 정보를 추출하여 의사결정에 사용하는 일련의 프로세스 – 가설 확인 중심의 기존 조회 방식(SQL, OLAP)과는 달리 가설 발견 방식 POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehouse Architecture Management Platform Information Delivery System Metadata MRDB Data Extract Data Cleanup Report, Query,EIS tools Data Load Data WH DBMS MDDB DATA MARTS OLAP Tools Operational & External Data Data Mining tools Application & tools Data & Knowledge Engineering Lab. Admin Platform Repository POSTECH I.E. 가설 확인 중심의 기존 조회 방식 지역별, 평균 월수입과 에어컨 보유현황을 보여주시오. 질의 도구 가설 대형 아파트에 살고 월수입이 많은 사람 들이 에어컨을 더 많이 보유하고 있을 걸? 시각화 도구 데이터 OLAP 도구 POSTECH I.E. Data & Knowledge Engineering Lab. OLAP분석의 예 POSTECH I.E. Data & Knowledge Engineering Lab. 가설 발견 중심의 데이터마이닝 방식 A상품을 구매한 고객들의 특성을 보여주시오. 가설 데이터 마이닝 • 수입이 2백만원 이상인 40대의 남성으로서 R지역에 거주. • 수입이 2백만원 이상으 로 C직종에 종사. 데이터 검증 정보 POSTECH I.E. Data & Knowledge Engineering Lab. Background • Corporate data has grown rapidly over the last 15-20 years. The use of bar codes on all products, increasing use of credit cards, and mail order shopping are partly responsible for this growth. • This data usually resides in transaction databases and is difficult to use to support increasing need for sophisticated decision making. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • The need for analyzing and synthesizing information is growing in a fiercely competitive business environment of today. • Most enterprise databases were designed in the 1970’s or 80’s and were mainly designed to automate some of the office procedures, e.g. order entry, student enrolment, patient registration. These were well structured repetitive operations easily automated. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • The clerical view of data focusses on details which are required for day to day running of an enterprise while the management view of data focusses on summary data in order to identify trends, challenges and opportunities. • The detailed data view may be called the operational view while the management view may be called the decision-support view. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • • • • • • • • • • Operational Users - Admin staff Day-to-day work Application oriented Current data Detailed Simple queries Predetermined queries Ad hoc Update/Select Real-time Decision-support Users - Management Decision support Subject oriented Historical data Overall view - summaries or group by Complex queries queries Only Select Not real-time POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • Modern enterprises have many sophisticated applications, for example: – sales forecasting and analysis – marketing and promotion planning – business modeling • The design of conventional database systems did not take into account the requirements of such management applications. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • The relational database systems that most enterprises are using at the present time have limitations with respect to providing function to support user views of data. • The systems lack strong manipulative capabilities and therefore lack the ability to consolidate, view, and analyze data according to multiple dimensions. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • What a modern enterprise needs therefore is to be able to use the data in conventional enterprise systems to assist management decision making. • The aim of data warehousing, OLAP, and data mining is to provide management decision support using historical, summarized and consolidated data rather than operational data. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • It is natural to think of an enterprise data as multidimensional. For example, a university may think of its student data as threedimensional: year of admission X country X degree. The university may be interested in retrieving information like: – How many students are doing BIT? How many students from Thailand? How many students started in 1998? (queries involving only one variable) – How many students doing BIT are from Thailand? How many MIT students started in 1998? How many students from Thailand started in 1998? (queries involving two variables) – How many students doing MIT from Thailand started in 1998? (query involving all three variables) • There will be many such queries if the number of variables is larger than three. POSTECH I.E. Data & Knowledge Engineering Lab. Background (Continued) • All the above queries may be represented by a threedimensional data cube with each edge representing one of the variables viz. year, country, and degree. • A point inside the cube is an intersection of the coordinates defined by the edges of the cube. The coordinates of the point define the meaning of the data at that point. POSTECH I.E. Data & Knowledge Engineering Lab. Slicing a data cube POSTECH I.E. Data & Knowledge Engineering Lab. Data Cube • Each edge of the cube is called a dimension. A user normally has a number of different dimensions from which the given data may be analyzed. A user therefore has a multidimensional conceptual view of the data which is represented by the cube. • The points inside a cube provide aggregations. For example, a point may provide the number of students from Malaysia admitted to BCom in year 1998. POSTECH I.E. Data & Knowledge Engineering Lab. Multidimensional View • A particular user will have one multidimensional view of the database while another user in the same enterprise may have another view. Therefore many different multidimensional views of the same database are possible and the same data may be consolidated in many different ways. POSTECH I.E. Data & Knowledge Engineering Lab. Benefits of Multidimensional Analysis • Small high-level database with pre-computed aggregates is created for efficient high-level queries • Multiple-level views • Selection by slicing and dicing • However multidimensional analysis does not provide data mining POSTECH I.E. Data & Knowledge Engineering Lab. OLAP • These operations are often called On-line Analytical Processing or OLAP. Codd defines OLAP as the dynamic enterprise analysis required to create, manipulate, animate, and synthesize information from exegetical, contemplative, and formulaic data analysis models. OLAP deals only with historical data accurate at a given point in time. POSTECH I.E. Data & Knowledge Engineering Lab. OLAP Characteristics • Codd lists the following characteristics of OLAP: • Dynamic data analysis - involving historical data of multiple dimensions manipulated in many different ways with the aim of studying changes occurring in the enterprise • Common enterprise data - OLAP uses the enterprise data but in a very different way to discover why some particular situations occurred • Synergistic implementation - data synthesis, analysis, and consolidation POSTECH I.E. Data & Knowledge Engineering Lab. OLAP Characteristics • Four enterprise data model • Categorical - comparison of historical values • Exegetical - discovering reasons for what categorical model found • Contemplative - “what if” analysis of the data • Formulaic - how to reach a desired goal POSTECH I.E. Data & Knowledge Engineering Lab. OLAP • Generally OLTP systems are designed to handle short transactions efficiently. They are not designed to handle OLAP queries which are usually complex and would degrade the operations of an OLTP system. Furthermore OLTP systems normally only have current data while OLAP often requires historical consolidated data. POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehousing • A definition: data warehousing is a process, not a product, for assembling and managing data from various resources for the purpose of gaining a single detailed view of part or all of a business • Transaction records stored in most databases are not suitable for data warehousing because the records tend to be dynamic, incomplete, and not of high quality. POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehousing • Most database systems are changing all the time. A data warehouse is maintained separately and does not represent a snapshot of the operational database. • Most database systems are error-prone. should have as few errors as possible. A data warehouse • Most database systems continue to grow but a data warehouse should grow at a slower rate POSTECH I.E. Data & Knowledge Engineering Lab. Need for Data Warehousing • Integrated, company-wide view of highquality information. • Separation of operational and informational systems and data. POSTECH I.E. Data & Knowledge Engineering Lab. Examples of heterogeneous data POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehousing • A data warehouse contains information collected from multiple, independent data sources and integrated into a common repository for querying and analysis. • Often data warehouses are designed for on-line analytical processing (OLAP), where the queries aggregate large volumes of data in order to detect trends and anomalies. POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehousing • To speed up OLAP queries, a warehouse contains summarized and consolidated information representing materialized aggregate views of the enterprise data from a number of databases. • Data warehouse and OLAP are complementary. A warehouse stores data while OLAP derives strategic information from it. POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehousing • Warehouse usually contains information over time helping analysis of trends • A data warehouse is repackaging information to support business decision making • A data warehouse has two components: metadata management and warehouse administration • The aim in data warehousing may be to generate new revenue by selling the repackaged information POSTECH I.E. Data & Knowledge Engineering Lab. Generic data warehouse architecture POSTECH I.E. Data & Knowledge Engineering Lab. Three-layer architecture POSTECH I.E. Data & Knowledge Engineering Lab. Three-layer data architecture POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehousing Process • Extraction - data relevant to the tasks are selected and retrieved from a variety of sources. • Transformation - data is consolidated by performing summary or aggregations • Cleansing - since data comes from a number of sources, errors and anomalies are common. There is a need to remove anomalies, remove errors, handling missing and irrelevant data. Some tools are available for doing this. POSTECH I.E. Data & Knowledge Engineering Lab. Steps in data reconciliation POSTECH I.E. Data & Knowledge Engineering Lab. Data Cleaning • Data Cleaning overcomes problems like the following: • • • Inconsistent field lengths Missing entries Violation of integrity constraints • Inconsistent values • Data cleaning can be a very demanding task POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehouse Process • Design - The E-R diagram approach is not suitable for designing a schema for a warehouse. One approach is the star schema to represent the multidimensional data model. The schema in this model consists of a single fact table and a single table for each dimension. Other models have been used. POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehouse Process • Integration - combining data from many perhaps heterogeneous sources. This is a non-trivial task since different sources will use different formats, field lengths, codes, descriptions, for the same data items. • Loading - before loading additional processing may be needed, e.g., checking integrity constraints, building derived tables, indices, access paths. • Refresh - warehouse data needs to be periodically updated as the operational data changes. The update could happen daily, weekly or even monthly. Also, the updates need to be logically correct since the warehouse data is derived data. POSTECH I.E. Data & Knowledge Engineering Lab. Data Warehouse Process • Chaudhuri and Dayal present the following process: • Define the architecture, do capacity planning, select hardware and software • Integrate hardware and software • Design the warehouse schema and the views • Design the physical data structures • Design data extraction, cleaning, transformation, load and refresh software • Populate the reporsitory with data and software • Design and implement end-user application POSTECH I.E. Data & Knowledge Engineering Lab. Star Schema • Also called the dimensional model. • Fact and dimension tables. • Grain of a fact table - time period for each record. POSTECH I.E. Data & Knowledge Engineering Lab. Components of a star schema POSTECH I.E. Data & Knowledge Engineering Lab. Star schema example POSTECH I.E. Data & Knowledge Engineering Lab. Star schema with sample data POSTECH I.E. Data & Knowledge Engineering Lab. Star schema with two fact tables POSTECH I.E. Data & Knowledge Engineering Lab. Example of snowflake sample POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining • The efficient automated discovery of previously unknown patterns in large volumes of data • Businesses are mostly interested in discovering past patterns to predict future behaviour • A typical data mining question might be “what are the characteristics of students from the state of Texas”; with the aim of increasing the number of students from Texas. POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining (3) • We assume we are dealing with large data, perhaps Gigabytes or more • Although data mining is possible with smaller amount of data, bigger the data, the larger the chance for discovering something unknown • There is considerable hype about data mining at the present time and Gartner Group has listed it as one of the top ten technologies to watch. POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining (4) • Data mining includes a large number of techniques including decision trees, neural networks, nearest neighbour, genetic algorithms, statistics, etc • Expression and visualisation of data mining results is a challenging task • Privacy issues also need to be considered. POSTECH I.E. Data & Knowledge Engineering Lab. Why Data Mining? • Accumulation of large amounts of data • Increased computing power enabling data mining processing • Statistical and learning algorithms POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining Applications • Applications in financial, telecom, insurance and retail companies for – – – – – – market segmentation fraud detection better marketing trend analysis market basket analysis customer churn POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining (5) • Data mining is related to – data warehousing – Online analytical processing (OLAP) – data visualization • Data mining needs a data warehouse for effective mining. The aims of OLAP and data mining are similar but only data mining involves looking for unknown patterns. Finally, data mining requires data visualization of presentation of results. POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining Products • IBM - Intelligent Miner and more • SAS - Enterprise Miner • Silicon Graphics - MineSet • many others POSTECH I.E. Data & Knowledge Engineering Lab. Data Mining Tasks • • • • • • • Class description Association Sequential Patterns Time-Series analysis Prediction Classification Clustering POSTECH I.E. Data & Knowledge Engineering Lab. 정보의 형태 및 기법 (데이터마이닝) 정보형태 연관(Association) 규칙 연속(Sequence) 규칙 분류(Classification) 규칙 데이터 군집화 (Clustering) 데이터 안에 존재하는 주요기법 항목간의 종속관계 연관규칙의 일종으로 시간의 흐름이 포함되어있는 항목간의 종속관계 부류를 서로 구분하는 의사결정나무 레코드의 특성 전통적 통계 신경망 데이터를 유사한 특성을 지닌 몇 개의 소그룹으로 나눈 것 동시발생 매트릭스 K-평균군집화 POSTECH I.E. Data & Knowledge Engineering Lab. 연관(Association) 규칙 POS 데이터 거래번호 제품명 제품코드 …. 1 넥타이 NT011 셔츠 ST001 셔츠 ST350 정장 벨트 코트 셔츠 FS123 BT432 CT005 ST001 프로세스 연관규칙 넥타이(NT011)  셔츠(ST001) 정장(FS123) & 벨트(BT432)  코트(CT005) 2 3 벨트 BT432 넥타이 NT011 셔츠 ST001 양말 벨트 코트 셔츠 정장 SK100 BT432 CT005 ST350 FS123 4 : : : POSTECH I.E. Data & Knowledge Engineering Lab. 연속(Sequence) 규칙 프로세스 POS 데이터 회원번호 거래일 1 구입품목 99-02-01 B, C 99-02-05 A 99-02-19 D, E, H 99-02-07 A 99-02-10 H 99-02-12 G 99-02-20 A, C, D 99-02-23 F 99-02-08 A,C 99-02-18 B, H 99-02-21 A 연속규칙 A품목을 구입한 회원이 향후 H품목을 구입할 가능성은 75%이다. 2 3 4 5 POSTECH I.E. Data & Knowledge Engineering Lab. 분류(Classification) 규칙 데이터 번호 1 2 3 4 5 6 7 8 9 10 11 12 13 14 프로세스 부류(목표변수) 속성(항목) 직업 무직 무직 자영 고용 고용 고용 자영 무직 무직 고용 무직 자영 자영 고용 성별 남 여 여 남 여 남 여 여 남 남 여 남 여 남 거주지 강북 강북 강북 강북 강남 강남 강남 강북 강남 강남 강남 강북 강남 강북 레코드 나이 35 51 31 38 33 54 49 32 32 35 54 50 36 49 응답 아니오 아니오 예 예 예 아니오 예 아니오 예 예 예 예 예 아니오 부류값 분류규칙 ‘예’라고 답한 부류의 특성 직업 = ‘자영’ 또는 직업 = ‘고용’ & 나이  43세 또는 직업 = ‘무직’ & 거주지 = ‘강남’ POSTECH I.E. Data & Knowledge Engineering Lab. 데이터 군집화 (Clustering) 프로세스 A E 고객 데이터 D 분류규칙 탐색 고객 군집별 특성 파악 B C F 고객 분할 (군집) B군집에 속한 고객의 특성 예 • 소득이 200만원 이상이고, 자녀가 없으며, 연령이 30대. • 교육수준이 높으며, 자녀는 모두 출가했고, 연평균 구매액이 200~300만원 정도. POSTECH I.E. Data & Knowledge Engineering Lab. Class Description • Summarization of a collection of data is called class description. • A class description may be used to compare, for example, undergraduate and postgraduate students. • Class description provides summary properties as well as variance of the properties values. POSTECH I.E. Data & Knowledge Engineering Lab. Associations • Given a set of transactions, each containing a subset of items from an item set, discovery of association relationships or correlations among a set of items • Discovering that personal loans are repaid with 80% confidence when the person owns his home • The classical example is the one where a store discovered that people buying nappies tend also to buy beer POSTECH I.E. Data & Knowledge Engineering Lab. Associations • The association rules are often written as X => Y meaning that whenever X appears Y also tends to appear. • Application in supermarket like Woolworths may have several thousand items and many millions of transactions a week (which could be several Gigabytes of data each week). • Note that the quantities of items bought in a transaction is ignored. POSTECH I.E. Data & Knowledge Engineering Lab. Sequential Patterns • A set of objects with timestamps is given. The object is to find patterns like the following: • If the BHP stock has risen for three consecutive days and the rise is more than 5% then on the fourth day there is a 80% chance of the stock going down • If the All Ordinaries Index has risen for two consecutive days and the BHP stock has gone down on the same two consecutive days then there is a 70% chance that the BHP stock will rise POSTECH I.E. Data & Knowledge Engineering Lab. Time-Series Analysis • Data mining in large sets of time-series data is to find certain interesting patterns e.g. similar sequences or subsequences. • Common applications for such techniques is to analyze stock market data in an attempt to find patterns. POSTECH I.E. Data & Knowledge Engineering Lab. Prediction • Many data mining problems may be considered as attempts to make predictions based on prior samples • Prediction may be based on finding trends, sequential patterns, etc POSTECH I.E. Data & Knowledge Engineering Lab. Classification • A set of training objects each with a number of attribute values are given to the classifier. • The classifier formulates rules for each class in the training set so that the rules may be used to classify new objects. • Some classification techniques do not require training data. • Decision tree approach appears to be commonly used in data mining. POSTECH I.E. Data & Knowledge Engineering Lab. Clustering • Clustering is similar to classification in that its aim is to to build clusters such that each cluster is similar within itself but is dissimilar to other clusters. • The principle used therefore is to maximize the intracluster similarity and minimising the intercluster similarity. POSTECH I.E. Data & Knowledge Engineering Lab. Associations • To discover associations, we assume that we have a set of transactions, each transaction being a list of items (e.g. list of books) • Suppose A and B appear together in only 1% of the transactions but whenever A appears there is 80% chance that B also appears • The 1% presence of A and B together is called the support of the rule and 80% is called the confidence of the rule POSTECH I.E. Data & Knowledge Engineering Lab. Associations • A user might be interested in finding all associations which have x% support with y% confidence such that – all associations satisfying user constraints are found – associations are found efficiently from large databases • Confidence denotes the strength of the association. • Support indicates the frequency of the pattern. • A minimum support is necessary if an association is going to be of some business value. POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • To find such associations, a simple two step approach may be used: • Step 1 - discover all frequent items that have support above the minimum support required • Step 2 - Use the set of frequent items to generate the association rules that have high enough confidence level POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • Scan all transactions and find all items that have transaction support above x%. Let these be L1. • Build item pairs from L1. This is the candidate set C2. • Scan all transactions and find all frequent pairs in C2. Let this be L2. • General rule - build sets of k items from Lk-1. • This is set Ck. Scan all transactions and find all frequent sets in Ck. Let this be Lk. POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • Consider an example with the following set of transactions: TID Items bought ------------------------001 B, M, T, Y 002 B, M 003 T, S, P 004 A, B, C, D 005 A, B 006 T, Y, E 007 A, B, M • Assume that we wish to find associations with at least 30% support and 60% confidence. POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • The list of frequent items is now computed. Only the following three items qualify as frequent since they appear in more than 30% of the transactions. This is set L1. Item Frequency -----------------------A 3 B 5 M 3 • These three items form three pairs {A, B}, {B, M}, and {A, M}. This set is C2. Now find the frequency of these pairs. POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • The frequency of the pairs is Pair Frequency -----------------------{A, B} 3 {B, M} 3 {A, M} 1 • The first two pairs have more than 30% support. What about their confidence level? A--> B has confidence level of 100%, B --> A has confidence level of 60%, B --> M 60%, M --> B 100%. All are therefore acceptable. POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • The frequent item pairs (that is L2) are: Pair Frequency -----------------------{A, B} 3 {B, M} 3 • These pairs are now used to generate a set of three items (i.e. C3). In this simple example only one such set is possible which is {A, B, M}. The frequency of this set is only 1 which is below 30% support and therefore this set of three items does not qualify. POSTECH I.E. Data & Knowledge Engineering Lab. The Apriori Algorithm • The algorithm to construct the candidate set for large itemsets is crucial to the performance of the Apriori algorithm. • The larger the candidate set, higher the processing cost for discovering the large item sets. • Given that the early item sets are very large, the initial iterations dominate the cost. • It is the generation of the large 2-item sets that is the key to improving the performance of the algorithm. POSTECH I.E. Data & Knowledge Engineering Lab. Variety of Algorithms • More than one algorithms is often used since one algorithm may discover something that others do not • Algorithms include neural networks, induction, association, fuzzy logic, statistical, visualization • Predictive modeling, database segmentation, link analysis, and deviation detection POSTECH I.E. Data & Knowledge Engineering Lab. Why Data Mining is not being used more? • Data mining is very resource intensive; it can take years for results to be achieved • Some data mining software and some experts use statistical expertise which user does not know POSTECH I.E. Data & Knowledge Engineering Lab. References • E. F. Codd, S. B. Codd, and C. T. Salley, Providing OLAP to User-Analysts: An IT Mandate, available from http://www.arborsoft.com/OLAP.html W. H. Inmon, Building the Data Warehouse, John Wiley, 1992. S. Chaudhuri and U. Dayal, An Overview of Data Warehousing and OLAP Technology, ACM SIGMOD Record, 26 (1), pp 65-74, 1997. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996 M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a Database Perspective, IEEE Transactions on Knowledge and Data Engineering, 8(6), pp 866883, 1996. M. J. A. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales, and Customer Support, Wiley, 1997. • • • • • POSTECH I.E. Data & Knowledge Engineering Lab.

Related docs
E-Business Model
Views: 95  |  Downloads: 8
E-Business Value Strategies
Views: 245  |  Downloads: 29
E-Business Models
Views: 235  |  Downloads: 30
E-BUSINESS
Views: 134  |  Downloads: 0
Supply Chain Management for e-Business
Views: 0  |  Downloads: 0
E Business Opportunities
Views: 28  |  Downloads: 8
Security Issues in e-business
Views: 522  |  Downloads: 54
E-Business Value Strategies-PPT
Views: 3726  |  Downloads: 208
e-Business
Views: 10  |  Downloads: 0
Other docs by Pauil Brodie
Economics of Private Equity Market
Views: 586  |  Downloads: 47
Child custody and maintenance
Views: 863  |  Downloads: 20
Taylor v Vallelunga
Views: 239  |  Downloads: 2
Chase outline from outlinedepot
Views: 333  |  Downloads: 27
Alternative_Exits_Conference
Views: 204  |  Downloads: 1
dv120c
Views: 127  |  Downloads: 0
Hill Anderson Summers Hall Sindell
Views: 280  |  Downloads: 1
cm020
Views: 151  |  Downloads: 0
Give Me the Heart of a Servant
Views: 283  |  Downloads: 0
dv210infoc
Views: 96  |  Downloads: 0
Current Accounting and Disclosure Issues
Views: 863  |  Downloads: 36
Change My Heart O God
Views: 315  |  Downloads: 4
Holy Ground
Views: 249  |  Downloads: 1
Persian Essay
Views: 1164  |  Downloads: 9