Migrating to Open Source Databases: The Data Warehouse Case Study
Jutta Horstmann (jh@weltraumsofa.de) November 10, 2005
1
Case Study: ImmobilienScout24
ImmobilienScout24 operates the leading German real estate marketplace on the internet, since 1999. Interested parties can look for suitable apartments, houses, land, or vacation cottages and apartments online and view additional information. At 240 million page impressions per month at present, www.immobilienscout24.de is the most visited real estate site in the German-language internet. Its database contains more than 500.000 real estate offers to choose from.1 This case study stops very early in the migration workflow. Analysis of the source system and evaluation of possible Open Source replacements resulted in cancelling the project. As such, this case study is a typical example of when and why not to migrate. Most of the chapter will cover Open Source alternatives for data warehousing. These proved to be not elaborated enough yet for replacing the Immobilienscout setup. Thus I will motivate the decision against migrating as well as discuss incremental migration options.
2
The Data Warehouse: Status Quo
The data warehouse consists of an operational data store (ODS), keeping a consistent snapshot of the operational data, and a decision support system (DSS), providing historical data of the last 7 years. The operational data stems from the primary database system (the web portal’s backend) as well as from flat files
http://www.is24.de/de/ueberuns/presseservice/press\_releases/trade\ _press\_publications.jsp, http://www.is24.de/de/ueberuns/presseservice/press\ _releases/factsheet\_eng.jsp
1
1
of logged events. ODS and DSS run on dual-processor Windows 2000 / Oracle 8i servers with each approx. 250GB of data. The Extraction, Transformation & Load process between the databases is done by Oracle PL/SQL stored procedures, developed in-house. The data warehouse serves management requests for reports on business facts grouped by several dimensions, e.g. time spans, regions or product categories. It provides rudimentary ad-hoc reporting based on manual database excerpts as well as full-grown business intelligence. The latter is supported by one Microstrategy Intelligence server2 , providing metadata for multi-dimensional analysis based on the data in the DSS, using Relational OnLine Analytical Processing. The Microstrategy server runs on a separate dual processor Windows box. For administration, design and development, the data warehouse team uses a combination of Oracle Enterprise Manager3 , Toad4 , ERwin5 and UltraEdit6 on Windows desktops. Microstrategy Architect, Microstrategy Desktop and Microstrategy Web are the business intelligence clients.
3
The Target System
An alternative scenario features an Open Source database management system replacing Oracle, and Open Source tools. I will first proceed to discussing database replacement options.
3.1
Features of a Data Warehouse Database
The term ”Data Warehouse” was coined by William H. Inmon [Inm96]: ”[A] subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process” Generally, any database management system may serve as data repository to a data warehouse system. But it will have to meet other requirements than an OLTP system: • smaller number of (concurrent) users
2 3
http://www.microstrategy.com http://www.oracle.com/technology/products/oem/index.html 4 http://www.toadsoft.com/toad\_oracle.htm 5 http://www3.ca.com/Solutions/Product.asp?ID=260 6 http://www.ultraedit.com/
2
• response time less important (only for interactivity, but the OLAP server may solve this by caching) • read-only access by users • load from operational data store will only insert new records, existing ones do not get changed (updated) • bulk load from operational data store, no single-record inserts (at most once daily) • database access less frequent but executing large and complicated queries that access many rows per table • database design partly de-normalized and redundant for better performance (normalization works fine for transaction processing but is inefficient for OLAP queries due to the excessive amount of joins that will be required to answer business questions.) • database design is data-driven, not workflow-driven • large storage capacity for historical data (fact tables with tens or hundreds of millions of rows and dimension tables with only hundreds or thousands) • may also contain aggregate data, less atomic data • most data access will be targeted at a small partition of the data: the last month or quarter • inconsistent, primarily long- running and complex read-only transactions instead of high constant transaction rate So, one does not have to care that much about locking, ACID transactions and transaction logging. Much more interesting are: - materialized views (containing aggregated data) - table partitioning (to reduce overall I/O contention and improve parallel operations) - SQL:99 features: New GROUP BY operators for easier formulation of OLAP queries: ROLLUP, CUBE, GROUPING SETS, (see [Tue03], p. 204 ff.) - SQL:2003 features: Windowing functions: ROW NUMBER, RANK, DENSE RANK, PERCENT RANK, CUME DIST (see [Kli04], p. 10) - bulk load support - bitmap join indexes (pre-join dimension tables and fact tables and store the result in a single bitmap index) 3
-
performance enhancements for bitmap indexes merge/”upsert” parallelization of queries data aggregation OLAP support efficient full table scans query caching loader speed
3.2
Open Source databases for the Data Warehouse
As Linux systems comprise less than 3% of the OLAP server market [OLAP4], there is also little experience up to now with regard to Open Source data warehouse databases. There are no case studies or other resources to be found mentioning MaxDB or Firebird in the data warehouse context7 . Similar with Ingres, apart from Ingres being used as database backend of the DATAllegro Data Warehouse Appliance [Ingres]. MySQL lacks necessary features, but is present in the data warehouse market nevertheless: O Reilly, Sabre Holdings/Travelocity and Cox Communications are using it for their data warehouses [MySQLO, MySQLS, MySQLC]. Business Intelligence vendors Business Objects and Hyperion (Essbase) integrate MySQL with their products [MySQLB, MySQLE]. There is also a book short of being published on this subject: John Paul Ashenfelter: Data Warehousing with MySQL [Ash05]. Starting spring 2005, PostgreSQL has got its own Data Warehouse distribution called Bizgres, sponsored and co-developed by Greenplum (http://www. bizgres.org, http://www.greenplum.com). Bizgres aims at adapting PostgreSQL to data warehousing needs by changing and adding code as well as integrating related projects. The current Bizgres version is 0.7 (September 05). It includes a complete Open Source data warehouse stack, containing contains an extended PostgreSQL server, ETL (Extraction, Transformation, Load) tools from Kinetic Networks Inc. (http://www.kineticnetworks.com) and JasperReports (http://jasperreports.sourceforge.net/), an open-source reporting engine from JasperSoft Corp. (see more on JasperReports below in section 3.4). In August 2005, a sample application called Bizgres Clickstream was released, showcasing the stack’s use for web server data mining8 .
Except from one academic paper on developing an open-sourced data warehouse builder based on Borland InterBase 6 Open Edition Database Server [LS03] 8 http://www.bizgres.org/assets/docs/clickstream/html/TOC.htm
7
4
The feature list includes9 (most available in 0.7, some planned for v.1.0): Table partitioning Bitmap indexes, bitmap scan Enhancement of loader speed for CSV and text data OLAP support (rank, cube, dense rank) Performance enhancements for bulk loading bitmap indexing, temporary table creation Temporary file I/O improvement Statistics fix for optimizer Replacement of qsort() with an internal sort I/O pre-fetch enhancements Inclusion of Slony (replication) Inclusion of PgAdmin (admin tool) Inclusion of Jaspersoft (reporting toool) Improvement of client side driver support (JDBC, ODBC) BI Tools integration (Microstrategy) BI Tools integration (Business Objects) BI Tools integration (Informatica) interactive GUI installer and interactive configurator
3.3
Open Source Business Intelligence
Business Intelligence (BI) can be described as the process of enhancing data into information and then into knowledge [GRC04]. Or, more elaborate: ”Business intelligence is the process of gathering and analysing internal and external business information. In addition, business intelligence is defined as the process which supports operational and tactical business decision-making.” [OPHL02] A business intelligence architecture uncludes a tool for providing multi-dimensional reports. This software relies on data provided by an OLAP server. This server may be the primary business intelligence database (multi-dimensional OLAP: MOLAP), using a specific dimensional database engine. Or there are distinct OLAP tools (as part of the business intelligence suite itself or as an extra server) accessing a relational DBMS (ROLAP).
9
http://lwn.net/Articles/142076/, http://www.bizgres.org/articles/?id=3
5
There are two business intelligence vendors providing support for an Open Source database as part of a ROLAP setup: Business Objects10 and Hyperion Essbase11 both support MySQL. Nevertheless it is possible to connect any business intelligence tool to Open Source databases using ODBC or JDBC (but you will not get full functionality of the software). An OLAP server specifically designed for MySQL is provided by EBM Software: OLAP4All features an open Java API, but is no Open Source software (http://www.olap4all.com). There is yet no such thing as an enterprise level Open Source business intelligence suite. The year 2005 nevertheless saw the emergence of several projects emerging or gaining strength: BIRT (Business Intelligence and Reporting Tools) is a project of the Eclipse Foundation and Actuate Corp. (http://www.eclipse.org/birt/, www.actuate.com/birt/). It is currently focussed on providing embedded reporting for J2EE applications. BIRT’s short- and long-term goals: 1. provide Eclipse plugins for design, deployment, creation and presentation of reports 2. OLAP and Business Intelligence dashboard functionality 3. support Executive Information Systems (EIS), statistical analysis, whatif analysis, Data Mining, Data Warehouse modelling, provide ETL tools, Data Quality tools BIRT is tested against MySQL, but supports any RDMBS that provides JDBC. Mondrian is an OLAP server written in Java (http://mondrian.sourceforge. net/). It executes MDX queries, reads data from a relational database and presents the results multi-dimensionally. Mondrian supports IBM DB2, Oracle, MS Access, MS SQL Server, MySQL, PostgreSQL, as well as any other relational database management systems by JDBC. The report presentation is implemented in JSP by the JPivot library (http://jpivot.sourceforge.net/). It provides OLAP table and their navigation (slice & dice, drill down, roll up). OpenI by Loyalty Matrix, Inc. is a J2EE web application running on top of tomcat (http://openi.sourceforge.net/, http://www.loyaltymatrix.com) . It publishes web-based analytical reports from three types of data sources: OLAP servers, relational database servers, and data mining servers. Pentaho is a very fresh start-up that aims at providing a ROLAP server setup with web frontend and reporting/analytical functionality (http://www.pentaho.
http://www.mysql.com/news-and-events/press-release/release\_2005\_10.html http://dev.hyperion.com/resource\_library/support\_matrixes/deployment\ _services/eds\_72\_faq.cfm\#l
11 10
6
org). Therefore it offers a complete Open Source business intelligence stack including Eclipse BIRT, JasperReports, Mondrian, and JPivot. Database integration is based on JDBC, namely supporting Firebird, DB2, MS SQL, MySQL, Oracle, PostgreSQL. Palo is an Open Source MOLAP server under development by the German company Jedox (http://www.opensourceolap.org/, http://www.jedox.com/). A first release is expected in fall 2005. The BEE Project lacks the interoperability of the aforementioned applications (http://bee.insightstrategy.cz). It is written in Perl and C, supports only MySQL and runs only on Linux/UNIX platforms. Furthermore it is only scalable up to 50GB of data in the MySQL repository. BEE provides an ETL Tool, ROLAP Server and Client.
3.4
Open Source Reporting
Besides advanced multi-dimensional analysis, there are also requests for low-level, two-dimensional reporting based on the Operational Data Store at Immobilienscout24. Nowadays this is provided by ad hoc SQL queries, resulting in simple Excel sheets. To address the need for automated, professional reporting, there is a choice of several Open Source reporting tools: Agata is a cross-platform database reporting tool with graph generation (http: //www.agata.org.br). It reads data from various relational database management systems and exports reports to PostScript, plain text, HTML, XML, PDF, CSV, OpenOffice Writer. Agata features a command line and a web interface. BIRT was already mentioned above. The current version provides reporting functionality based on the Eclipse framework. DataVision compares itself to the Crystal Reports software (http://datavision. sourceforge.net). It offers report design via a drag & drop GUI or directly in an XML editor. Reports may be exported to HTML, XML, PDF, LaTeX2e, DocBook, CSV. DataVision is written in Java and connects to any database providing JDBC as well as to text files. JasperReports, available Open Source for some years already, has been taken over by the company JasperSoft in 2004 (http://jasperreports.sourceforge. net). They hold a business partnership with MySQL, while PostgreSQL includes JasperReports in its Bizgres distribution. JasperReports is a platform independent Java application that produces reports as PDF, HTML, XLS, CSV and XML. Report designs are written as XML templates (no GUI). OpenRPT was part of OpenMFG’s ERP package and got open-sourced in April 2005 (http://www.openrpt.com/). It compares itself to Crystal Reports and
7
Microsoft Access Report writer. OpenRPT features report definitions in XML, stand-alone or embeddable WYSIWYG report designer, report rendering to PDF and Postscript.
4
4.1
Conclusion
The Case against the Open Warehouse
This case study serves as an example for rejecting the migration plan after analyzing the source system and evaluating target system alternatives. Immobilienscout decided against migrating based on the following reasons: • There are only references for Open Source data warehouses based on MySQL, which does lack lots of relational database features (like stored procedures and triggers) that the Immobilienscout Oracle warehouse relies heavily on. • There are emerging options for building a data warehouse based on PostgreSQL/Bizgres, but no case studies yet and the Bizgres software is still in beta. • There is very little independent software vendor support for Open Source databases in the business intelligence sector. As Microstrategy does not support any of them, a new BI tool would have to be acquired. The available Open Source business intelligence tools were either too small-featured or only in alpha or beta phase yet. • Only one of the evaluated Open Source database management systems provides database links: PostgreSQL, allowing for example a procedure running in one schema to access the data in another schema. The current data warehouse setup makes plenty use of Oracle DBlinks. But, as said above, the PostgreSQL/Bizgres stack is still in beta.
4.2
Options
The decision at Immobilienscout was not against Open Source software per se. For example, they do already employ Apache and Tomcat servers running on Linux machines. As the data warehouse setup is lacking an ad-hoc reporting tool at the moment, a decision pro JasperReports was quickly reached. This is much easier to implement, as it affects the introduction of a new software instead of migrating current tools. This could be the first step towards an incremental migration of the data 8
warehouse software towards Open Source alternatives. If the administrators gain experience with JasperReports, introducing a larger Open Source Business Intelligence stack entailing Jasper, like Pentaho or Bizgres, could be a possibility in the future.
9
References
[Ash05] John Paul Ashenfelter. Data Warehousing with MySQL, chapter 2. APress, 2005. WWW: http://www.ashenfelter.com/content/ oscon2004_dwmysql_ch02.doc (last checked 09/23/05). Matteo Golfarelli, Stefano Rizzi and Juris Cella. Beyond Data Warehousing: What’s Next in Business Intelligence? In: DOLAP’04, November 12/13, 2004, Washington, DC, USA., 2004. WWW: http://www.ececs.uc.edu/~dolap04/DOLAPdocs/papers/ golfarelli.pdf (last checked 09/24/05). Ingres and Open Source - a success story. WWW: http://www. channelregister.co.uk/2005/03/24/ingres_and_open_source/ (last checked 09/24/2005). William H. Inmon. Building the Data Warehouse. Wiley, 2. Edition, 1996. Kevin E. Kline. SQL in a nutshell. O’Reilly, 2. Edition, 2004. Maurice Ling and Chi Wai So. Architecture of an Open-Sourced, Extensible Data Warehouse Builder: InterBase 6 Data Warehouse Builder (IB-DWB). In: Proceedings of the Inaugural Australian Undergraduate Students’ Computing Conference, 2003. WWW:
[GRC04]
[Ingres]
[Inm96] [Kli04] [LS03]
[MySQLB] MySQL and Business Objects Announce Partnership. http://businessobjects.com/news/press/press2005/ 20050418_mysql_part.asp (last checked 08.05.2005).
[MySQLC] Cox Communications Powers Massive Data Warehouse with MySQL. WWW: http://www.mysql.com/it-resources/case-studies/ mysql-cox-casestudy.pdf (last checked 08.05.2005). [MySQLE] Essbase Integration Services Release 7.1 Support Matrix. WWW: http://dev.hyperion.com/products/hub/hub_721_matrix.pdf (last checked 09/24/2005). [MySQLO] Roger Magoulas: Building the Open Warehouse (MySQL Conference 2005). WWW: http://conferences.oreillynet. com/presentations/mysql05/magoulas_roger.%pdf (last checked 08.05.2005). [MySQLS] Golden’s Rules: Data warehousing at 75% off with open source, part 2. WWW: http://searchenterpriselinux.techtarget.com/ columnItem/0,294698,sid39_g%ci1024648,00.html (last checked 09/24/2005). 10
[OLAP4]
The OLAP Survey 4. WWW: http://www.olapreport.com/ survey.htm (last checked 09/27/2005).
[OPHL02] J. Okkonen, V. Pirttim¨ki, M. Hannula and A. L¨nnqvist. Triangle a o of Business Intelligence, Performance Measurement and Knowledge Management. In: IInd Annual Conference on Innovative Research in Management, May 9 - 11, Stockholm, Sweden., 2002. [Tue03] Can Tuerker. SQL:1999 & SQL:2003 - Objektrelationales SQL, SQLJ & SQL/XML. dpunkt.verlag, 1. Edition, 2003.
11