EUROPEAN COMMISSION INFORMATION SOCIETY TECHNOLOGIES (IST) PROGRAMME Deliverable 2.1 Conceptual, logical and physical model of datamarts Work Package No. 2: Concept of IT framework issues and development of software for the creation of a multi-source data base Alberto Sorce Filippo Oropallo ISTAT - National Statistical Institute of Italy ISTAT – National Statistical Institute of Italy Information System Economic Research and Microsimulation Via Balbo, 16 - 00184 ROMA Via Torino, 6 - 00184 ROMA Tel: +39 06 4673 2575 Tel: +39 06 4673 4186 E-Mail: firstname.lastname@example.org E-Mail: email@example.com ISTAT - September 2002 25 / Deliverable 2.1 1 IST Programme, contract No 2000-31125 Development of a System of Indicators on Competitiveness and Fiscal Impact on Enterprises Performance (DIECOFIS) PROJECT PARTNERS Coordinator Istituto Nazionale di Statistica (I) Principal Contractors Board of Inland Revenue (UK) London School of Economics (UK) University of Cambridge (UK) Università di Roma Tor Vergata (I) Università di Firenze (I) Centro di Ricerca Economica e Sociale (I) European Commission Joint Research Centre (I) Informer SA Computing Information Systems and Management Cons. (EL) Wirtschaftsuniversität Wien (A) 25 / Deliverable 2.1 2 IST Programme, contract No 2000-31125 Development of a System of Indicators on Competitiveness and Fiscal Impact on Enterprises Performance (DIECOFIS) SUMMARY OF DELIVERABLE The development of the integrated database was concerned with the study of the architecture of the information system. The logical and physical schemes have been produced. In these, substantially, we can distinguish three main groups of tables: - Enterprise life (demography, activity, size) - Nomenclatures and definitions to harmonize different sources (Metadata) - Tax oriented tables containing account, production and other variables needed by the modules of the microsimulation model for the estimation of tax burden and for the construction of indicators. The construction of indicators will be possible with the implementation of Data marts for specific analysis, where granular data are prepared with aggregations by size, or geographical area, economic activity etc. 25 / Deliverable 2.1 3 IST Programme, contract No 2000-31125 Development of a System of Indicators on Competitiveness and Fiscal Impact on Enterprises Performance (DIECOFIS) CONTENTS OF DELIVERABLE Introduction ............................................................................... 4 ISTAT experiences ...................................................................... 5 Diecofis Database ....................................................................... 6 Data model ................................................................................ 7 The identifying variables .............................................................. 8 The analysis of the Logical model ............................................... 10 Logical Model ........................................................................... 12 Physical Model.......................................................................... 16 Data marts for specific analysis .................................................. 20 Select Bibliography ................................................................... 23 25 / Deliverable 2.1 4 Introduction Workpackage 2 is concerned with the development of the integrated database and with the building of several collections of datasets to serve specific micro-analysis. Integration will take place by means of the development of software containing matching techniques of different sources of data and re-weighting units of firms’ samples. A software user manual will guide in the integrating process. Objectives of WP2: • Creation of a multi-source integrated and systematised database • Collection of datasets for micro-analysis Work of WP2: • Development of the logical and physical scheme of the data base • Analysis of the software for the integration and the collection of datasets • Code for data downloading and user interface • Software user manual Deliverable: The deliverable contains: • A description of the ISTAT experience concerning integrated databases. • Features of Diecofis Database, with a look at the identifying variables. • Logical and Physical schemes. • Data Marts for specific analysis. 25 / Deliverable 2.1 5 ISTAT experiences In the last years the Italian NSI has started several studies and projects about integration of data from different sources. One project concern the development of the business register ―ASIA‖. This archive is the result of an integration process of different administrative sources. It leads to the creation of the statistical register of active enterprises. Basically activity is presumed by means of a logistic model where the probability of existence is a function of various signs of life drawn from several administrative sources (Revenue Agency, Chamber of Commerce, Social Security Institute, Insurance Institute at work, yellow pages). Other projects are attempting the integration of the business register, which identify the business unit, with other administrative data. In the database Coe are integrated information about foreign trade. Other projects are attempting the integration of business register with accounts data from Chambers of Commerce. An important activity in this direction is the development of the Statistical Information System on Enterprises and Institutions. The part concerning with enterprises would integrate all business statistics for the management of business units and for data analysis. The system is based on database network which integrate business surveys of the Department of Economic Statistics. This integrated system would also bring to systematise all information and to the development of a modern system of dissemination. IT developments, during recent years, permits to develop an unique environment and to build Data Warehouses for the dissemination of large databases, which can fulfil different users’ needs. 25 / Deliverable 2.1 6 Diecofis Database The study of Istat surveys of enterprises within Work Package 1 leads to plan a data programme. The work deals with the creation of a database in order to gather information deriving from different statistical surveys. The idea comes out from the need of having an on-line source able to provide as many information as possible on structure and activity of Italian enterprises. The project’s first stage deals with the storage of micro-data of SCI survey which concerns all Italian enterprises with at least 100 employed; further steps will involve integration with other business surveys (PMI, COE, PRODCOM, New economy etc.), and of administrative data (data of Chambers of Commerce and possibly fiscal data). The specificity of this work can be summarised in the following items: - A big amount of data (they occupy various dozens of gigabytes on the disk); - Complete flexibility of the System, it must be free from specific surveys, in order to include step by step present and future information.; - Systematic use of the server of Istat by Diecofis consortium. All this imposes a careful study of the logic structure of the database that, when set up cannot be modified without a radical change of the whole system. In addition to that it imposed a new coding of statistical items in order to unify those items so far codified according to different surveys. 25 / Deliverable 2.1 7 Data model Groups forming the System are shown in the following picture: Statistical Information: ISTAT business surveys Business Register Foreign trade information Annual report and fiscal data Metadata Figure 1 – Data model The structure hub is the enterprise list, which for each enterprise has a record with an internal code (primary key) and codes used in ASIA archive, fiscal code and VAT number. General information and statistical data share dimensional information (workers, employees. turnover). This approach enables to deal in the same way with information provided by ASIA Archive and statistical surveys, as well as by other archives. The modularity of the System allows: - Prompt widening in case of new information (new surveys, new sources) to be added; - Easy and efficient location of information on various computer systems. Information can be filed in various server in order to have a distributed architecture. 25 / Deliverable 2.1 8 The identifying variables The data form was organised considering the following objectives: - To keep the historical memory of changes in the situation of a single unit; - To allow also retroactive changes of specific information, with a new reference date; - To collect information on identity up-dating from a wide range of different sources, keeping the information on the reference source; - To file information on the reference people of enterprises and on specific surveys. Three options were considered in choosing the logic model: - One single table including all information on the enterprise. This solution is little suitable in case of up-dating, and the up-dating of one single field requires to file the historical archive of the whole record; - One single table in which each line contains an information, with an identifying sign enabling to recognise the information and its format. This solution is more flexible: it allows adding all information even-though they were not in the original database. In this way research and management of information is very complicated. In addition to that the physical room they need is about three times the space used by each single field. Each field must have identification and we must consider the space required by its index. Due to the final dimension of the System, the use of space becomes unacceptable; - Few tables with grouped information according to the area. This was the chosen option because of its flexibility and efficiency. Information are grouped as follows: 1. Existence and activity of the enterprise 2. Fiscal and identity information 3. Location information 4. Dimensional information. 25 / Deliverable 2.1 9 These four groups allow flexibility in putting new information as new tables or new fields, efficiency in management, since no space is used to file identification (the only identification is the one of the enterprise). Moreover the grouping of variables was set minimising the space required by historical information: - shift of municipality implies a new address, but not a new enterprise name; - a new enterprise name usually implies a change of fiscal code and VAT number, without any change in location. Each table of the field ―year‖ enables to find immediately the record in force in the analysed year. Dimensional information (turnover, workers, employees) are recorded according to the reference year. In this way we have dimensional variables for each year. 25 / Deliverable 2.1 10 The analysis of the Logical model In order to set the logic structure of the system, we considered three different configurations, for each of them we considered qualities and defects. We then experimented them in order to find out the most efficient configuration. The last phase involved also the development of data navigator, which was used as an instrument of performance analysis. The three analysed configurations can be summarised in the following scheme: Table 1 – Efficiency and flexibility TYPE DESCRIPTION EFFICIENCY FLEXIBILITY Horizontal Tables contain one It requires to read one It is the least flexible tables single record for single record for each solution, for it requires each enterprise and enterprise and for each to change the structure for each year with year, but this record has of tables for any as many fields as many empty fields variation of surveyed the number of variables surveyed variables Completely Each line in a table It is the solution It is the most correct vertical records the amount requiring the highest solution from a relational tables of variables number of accesses to point of view for it leads referred to one the disk to be read: for to a complete year each enterprise and year normalisation of tables: various dozens of This implies a complete records have to be read, independence between but is not said that it is tables and the type of the least efficient for considered survey. only fields containing information are memorised. Vertical Each line of a table Requires a less number Also this solution is very tables with contains the of accesses than the flexible regarding horizontal amount of a previous one, but surveys and their years variable for all imposes to record also content; unfortunately it memorised years fields related with years obliges to establish in 25 / Deliverable 2.1 11 in which values were not advance the number of surveyed. years included in the database. Each line of a table contains the amount of a variable for all the years in which values were not recorded. A test was submitted to analyse the System performance in the three configurations, using data related with SCI surveys 1989-1998. In order to achieve the best performance, DBMS Oracle requires a synchronic ―end‖ of the all System, and this procedure was repeated in the three configurations. Some changes in the physical database structure (i.e. indices structure) or in the System structure (i.e. areas of memory allocated to DBMS) enabled to improve performances of one rank. At the end of this matching we saw that there where no relevant differences of performance. Executing a research by the data browser, we can consider that the server turnaround1 time is of some dozens of milliseconds (the same of some access to disks) time to send the answer on the web ranges from some dozens of milliseconds to (with a 10Mbit LAN no overloaded) to some seconds (for a 64Kbps CDN line) the rest of the time is used by the client station to organise results. Once we have checked that a research operation occupies the server for a negligible time, we decided to adopt the second configuration, the most flexible and most correct from a relation point of view. Tables of information coding were kept separated from the whole System. Those tables are geographical codes (municipality, provinces, regions, geographical areas, foreign countries), typology codes (legal form of enterprises, type of events, etc.) and economic codes (ateco 91, metadata of economic variables, etc.). These tables were memorised in a plan separated from the whole System, for management of these codes deals with a different organising structure from the one of single surveys on enterprises, and also because these codes have to be shared at a reading level by all informative subsystems, in order to have consistency among all information. Following figures represent the logic scheme followed by the physical one: 1 The time between the acceptance of a request by the System and its answer. 25 / Deliverable 2.1 12 Logical Model ENTERPRISE LIFE AND RELATED TABLES: STAFF and TURNOVER ACTIVITY (Sector of activity – CPA - NACE standard) - LEGAL (Legal form) - SITE (Localisation) - EVENTS (Transformations, etc.) - (Years 1996-2000) Produce / Produced Turnover Staff ID (FK) YEAR ID (FK) YEAR s taff Value independents employee ENTERPRISE carried out or carries out ID legal_no Activity had old s ite or has other s ites birth ID (FK) Site death ateco5 (FK) has or had ID (FK) s tatus year UnitID Endure update year Prov (FK) of Legal Mun (FK) ID (FK) pcode Events lform (FK) lform (FK) ateco5 (FK) ID (FK) year eventID (FK) year id_old of other before previous before new Figure 2 – Logical Model: Diecofis Database (Part 1 of 4) 25 / Deliverable 2.1 13 ENTERPRISE NOMENCLATURES AND DEFINITIONS: These tables contain the description of codes LegalForm Countries lform Regions S ource Country Des cr Reg ids ourc e EU lform1 des c r des c ription initials regularity des c r of other P rovinc es of of P rov E ventCode Reg (FK ) V ariables des c r vcode eventID ids ourc e (FK) Des cr year des c r A ctivityClas s c urrenc y atec o5 annotation des c r Municipalities measure_unit atec o3 Mun lett2 P rov (FK ) lett1 pcode population P roduc tID des c r c ipro prodcom_c ode atec o5 (FK ) CN8 des c r Figure 3 – Logical Model: Diecofis Database (Part 2 of 4) des c ription of 25 / Deliverable 2.1 14 Enterprise’s accounts data and direct taxation Corporate year (FK ) Corporate Dataset ID (FK) vcode (FK) des c rip ids ourc e (FK) value Regional Regional T ax and Soc ial Security Contributios year (FK ) ID (FK) des c ription vcode (FK) ids ourc e (FK) value weight P roduc t Datas e P ers onal P ers onal T ax Dataset des c r year (FK ) ID (FK) vcode (FK) ids ourc e (FK) value Figure 4 – Logical Model: Diecofis Database (Part 3 of 4) 25 / Deliverable 2.1 P roduc tID 15 c ipro prodcom_c ode atec o5 (FK ) CN8 Enterprise’s products and other features (ICT etc.) c r des des c ription of des c ription of Other Features of Other year (FK ) des c ripion of ID (FK) vcode (FK) ids ourc e (FK) P roduc t P roduc t Datas et value c ipro (FK) weight ID (FK) des c ription of year (FK ) vcode (FK) ids ourc e (FK) V alue weight Figure 5 – Logical Model: Diecofis Database (Part 4 of 4) 25 / Deliverable 2.1 16 Physical Model ENTERPRISE LIFE AND RELATED TABLES: STAFF and TURNOVER ACTIVITY (Sector of activity – CPA NACE standard) LEGAL (Legal form) - SITE (Localisation) - EVENTS (Transformations, etc.) - (Years 1996-1999) Turnov er Staf f ID : NU MBER(9) YEAR : NU MBER(4) ID : NU MBER(9) YEAR : NU MBER(4) v alue: N UMBER (13) independents : NU MBER(9) dependents: N UMBER(9) ENTERPR ISE ID : NU MBER(9) Sourc e legal_no: VAR CH AR2(11) ids ourc e: VARC HAR 2(9) Act iv ity birth: NU MBER(4) desc ription: VARC HAR2(20) ID : NU MBER(9) Site deat h: N UMBER (4) regularity : VARC HAR 2(2) at ec o5: VARCH AR2(5) UnitI D: N UMBER (6) st at us : NU MBER(4. 4) y ear: N UMBER (4) y ear: N UMBER (4) update: DATE ID : NU MBER(9) Prov : VARCH AR2(3) Legal Mun: VAR CHAR 2(3) ID : NU MBER(9) pc ode: VARC HAR 2(5) Variables Ev ents lty pe: VAR CHAR 2(3) lty pe: VAR CHAR 2(3) v code: VARCH AR2(9) at ec o5: VARCH AR2(5) ID : NU MBER(9) y ear: N UMBER (4) ids ourc e: VARC HAR 2(9) ev ent ID : VAR CH AR2(3) y ear: N UMBER (9) y ear: N UMBER (4) desc r: VARC HAR 2(60) id_old: NU MBER(9) currency : VARC HAR 2(20) annotation: VARC HAR2(50) Figure 6 – Physical Model: Diecofis Database (Part 1 of 4) Product 1 nc 8: VAR CHA at ec o5: VARC desc r: VARC H 25 / Deliverable 2.1 17 ENTERPRISE NOMENCLATURES AND DEFINITIONS: These tables contain the description of codes LegalT ypes NT ERPRISE Countries lform: V ARCHAR2(3) Regions D: NUMBE R(9) S ource c od_paes e: NUMB ER(3) des c r: V ARCHAR2(60) Reg: NUMB ER(2) gal_no: V ARCHAR2(11) ids ourc e: V ARCHAR2(9) lform1: VA RCHAR2(1) UE: VA RCHA R2(1) des c r: V ARCHAR2(40) rth: NUMB ER(4) s igla: V ARCHAR2(2) des c ription: V ARCHA R2(20) eath: NUMBE R(4) des c r: V ARCHAR2(50) regularity: V ARCHAR2(2) atus: NUMB E R(4.4) pdate: DA TE P rovinc es P rov: V ARCHA R2(3) E ventCode Reg: NUMB ER(2) V ariables des c r: V ARCHAR2(40) vcode: VA RCHAR2(9) eventID: V A RCHA R2(3) ids ourc e: V ARCHAR2(9) Des cr: VA RCHAR2(40) year: NUMB ER(9) des c r: V ARCHAR2(60) A ctivityClas s c urrenc y: V ARCHAR2(20) atec o5: VA RCHA R2(5) annotation: VA RCHA R2(50) des c r: V ARCHAR2(40) Municipalities measure_unit: VA RCHA R2(20) atec o3: VA RCHA R2(3) Mun: V A RCHA R2(3) lett2: V ARCHA R2(2) P rov: V ARCHA R2(3) lett1: V ARCHA R2(1) pcode: VA RCHAR2(5) population: NUMB ER(9) P roduc tID des c r: V ARCHAR2(50) c ipro: V ARCHAR2(12) prodcom_c ode: V ARCHA R2(8) atec o5: VA RCHA R2(5) CN8: V ARCHA R2(20) des c r: V ARCHAR2(40) Figure 7 – Physical Model: Diecofis Database (Part 2 of 4) 25 / Deliverable 2.1 18 Enterprise’s accounts data and direct taxation Corporate year: NUMB ER(9) ID: NUMBE R(9) vcode: VA RCHAR2(9) ids ourc e: V ARCHAR2(9) value: NUMBE R(12) Regional year: NUMB ER(9) ID: NUMBE R(9) vcode: VA RCHAR2(9) ids ourc e: V ARCHAR2(9) value: NUMBE R(9) weight: NUMB ER(9) P ers onal year: NUMB ER(9) c odice: NUMB ER(9) vcode: VA RCHAR2(9) ids ourc e: V ARCHAR2(9) value: NUMBE R(12) Figure 8 – Physical Model: Diecofis Database (Part 3 of 4) 25 / Deliverable 2.1 P roduc tID 19 c ipro: V ARCHAR2(12) prodcom_c ode: V ARCHA R2(8) atec o5: VA RCHA R2(5) CN8: V ARCHA R2(20) Enterprise’s products and other features (ICT etc.) des c r: V ARCHAR2(40) Other year: NUMB ER(9) c odice: NUMB ER(9) vcode: VA RCHAR2(9) ids ourc e: V ARCHAR2(9) P roduc t value: NUMBE R(12) c ipro: V ARCHAR2(12) weight: NUMB ER(4.4) ID: NUMBE R(9) year: NUMB ER(4) vcode: VA RCHAR2(9) ids ourc e: V ARCHAR2(9) V alue: NUMBE R(12) weight: NUMB ER(4.4) Figure 9 – Physical Model: Diecofis Database (Part 4 of 4) 25 / Deliverable 2.1 20 Data marts for specific analysis The Multisource database provide the starting point for data warehousing—the legacy application systems. In the legacy applications, transactions are executed in which detailed transaction data is collected. The detailed data that is collected is more of a by-product of transaction processing than anything else. But this raw data forms the very detailed foundation needed for the data warehouse. The transaction data that is collected passes through a layer of programs whose purpose is to integrate and transform the data to make it fit to enter the data warehouse. The layer of programs can be called ―integration and transformation‖ programs or ETL—extract/transform/load programs. These programs can be written manually or in an automated manner by program generators specifically designed to suit the needs of this interface. Once the legacy data passes through the integration and transformation programs, the data is fundamentally changed to meet the informational needs of the microsimulation model. Keys are altered so that there is one group understanding of data, structures are changed, and encoding algorithms are unified. The data enters the layer of integration and transformation programs in an unintegrated state and passes out of the layer of programs in an integrated state. Emanating from the entire statistical information system of enterprises is the collection of data marts. In many ways the data marts appear to be the data warehouse because it is here that the end user has direct interaction with the data warehouse environment. The data marts are almost always oriented to the needs and requirements of the analysis. The data marts are requirements- driven in that they are shaped by the unique and specific needs of the analysis. The data then is processed by specific modules of the model and is ready to serve as a basis for all sorts of decision support systems (DSS) processing, to build statistical studies and indicators for policy analysis. The diecofis integrated database would contain all the information about enterprise: - Identifiers; - Economic variables and balance sheet variables; - Foreign trade activity; - Commercialisation and energetic consumption; 25 / Deliverable 2.1 21 - Innovation features and cost structure. When we shall have all information we may use it for various purposes. So we build local view of data to serve various modules of the Microsimulation model and to build indicators. We could have data-marts oriented to the needs and requirements of the model. Data-marts are requirement-driven (cf. W.H. Inmon) in that they are shaped by the unique and specific needs of every module of the Microsimulation model. Therefore we can compare data-marts with datasets for tax simulating purpose. for specific analysis ODS: Operational Data Store Figure 10 - Today’s mature corporate information factory (cf. W.H. Inmon) Microsimulation modules are built to simulate taxation of enterprises, these are: 25 / Deliverable 2.1 22 - Corporate tax module - Regional (or local) tax and Social Contribution modules - VAT and Excise modules - Personal tax module As a consequence of this structure we need specific sub sets of enterprises with an accurate selection of variables needed So we may have the following subsets of enterprises for every module of the model: - Corporate tax dataset - Regional tax and social security dataset - Products dataset - Personal tax dataset At the final stage, after integration and when tax estimates are produced, granular data are prepared for the construction of indicators. Analysis will be made by: - Percentiles of enterprises (by size or turnover); - Class of economic activities; - Regions or geographical area; - Legal forms etc. 25 / Deliverable 2.1 23 Selected Bibliography Eurostat - ―Use of Administrative Sources for Business Statistic Purposes: Handbook on Good Practices‖ – Theme 4 (Industry, Trade and Services) - 1999 Eurostat Edition Giovannini, E., Sorce, A. – ―Integration of Statistical (survey) data with registers (administrative) data‖ – Meeting on the Management of Statistical Information Technology 2001 – paper. Giovannini, E., Sorce, A. – ―The Data Warehouse, a Modern System for the Dissemination of Information‖ – 2001 DISE-ISTAT paper. Inmon, W. H. – ―Data Marts and Data Warehouse: Information Architecture for the Millenium‖ – Informix Corporation. Sorce, A. – ―The Impact of Data Warehousing on the management of Statistical Offices‖ – New Techniques and Technologies for Statistics 2001 – Exchange of Technology and Know-how – Pre-Proceedings of the Conference (Crete 18-22 June 2001) – Volume II. Sorce, A., Capasso, G., del Mondo, G., Grasso, F., Vignola, L., Fanizzi C. – ―Statistical Information System on Enterprises (SISSI)‖ 1999 DISE-ISTAT paper.