VIEWS: 9 PAGES: 30 POSTED ON: 8/25/2011 Public Domain
Software Requirements Requirements for Actuarial Model Outcome Optimal Fit (AMOOF) Team Leader: Daniel Hanson Lead Developer: Matthew Miller Q.A. Manager: Jon Swenson Lead Tester: Edward Badgley Table of Contentsntroduction Current statistical software packages on the market can often fit the data to standard probability distributions, but none we have encountered include the mixed distributions commonly used in the actuarial field. The software solution we will create, Actuarial Model Outcome Optimal Fit (AMOOF), will find the best first standard probability distribution, then determine if a mixed distribution is needed; if a mixed distribution is required the software package will find the best fit mixed distribution. The idea for this project comes from a paper by our client, Dr. Yvonne Chueh, which can be found at: http://www.actuaries.ca/meetings/stochasticsymposium/Papers/Yvonne%20Chueh.pdf. Dr. Chueh is a Professor in the Mathematics department of Central Washington University. Actuarial Science and Insurance Modeling, areas in which Dr. Chueh specializes, can benefit from software that can fit data to mixed distributions. Dr. Chueh would like to be able to market this software, and in that case the program may be used by anyone who practices or researches Actuarial Science, in addition to our client. This document covers the requirements of the proposed software solution to this problem. This document is for the use of the client, developers, project managers and testers. The sections that follow are: Project Overview: covers the factors dictating this project. Development, Operation and Maintenance Environments: details of the hardware and software environments required for these three phases of the program. System Model: describes the major components of the system. User Interaction: covers the user’s view of the proposed program’s capabilities. Functional Requirements: explains the necessary abilities of the program. 1. Nonfunctional Requirements: describes restrictions on the program not related to the specific functions of the system. Feasibility: describes the probability of completing this project with our resources, the minimum system that fulfills our requirements, and the system we would prefer to develop. Appendices: contains all diagrams referred to in this document as well as a glossary of lesser known terms. 1 Project Overview The purpose of AMOOF is to find a parametric probability model that best describes a set of data. This model, or probability distribution, can then be used to make predictions based on the data. This model must be as accurate as possible because, for example, companies may use this information to decide the allocation of resources. To describe a data set AMOOF will use a probability distribution. When AMOOF calculates the best fit distribution it will first look for a best fit standard probability distribution, or probability density function (PDF). AMOOF will determine if a mixed distribution is needed by comparing the statistics of the data to the parameters of the best fit PDF. AMOOF will then determine if a mixed distribution is needed. When mixed distributions are determined to be necessary, AMOOF will find the best fit distribution of the weighted average of two PDFs. It is not feasible to find optimal probability distributions without software aid as data sets can include thousands, or even millions, of points. There are software packages on the market that fit to standard distributions, but not mixed distributions that describe the data more accurately. Currently an actuary could use a spreadsheet program to optimize each probability distribution they want to test for a data set, and then determine the best fit distribution based on their spreadsheets. This process would require the creation of a new spreadsheet for each distribution and manual entry of each equation. AMOOF will reduce the time and knowledge needed to make predictions based on a given data set. AMOOF will allow the user to find the optimal model of their data with a straight forward interface. The user will be able to enter data into a spreadsheet or import data from a file. When the data has been entered, the user will be able to find the best fit probability distribution with the click of a button. A report will be generated after an optimal fit distribution has been found giving statistics of the original data set compared to the parameters of the distribution function. AMOOF will also allow the user to view the data and the distribution graphically with multiple graphing options available. During the development of AMOOF, there are several important constraints to be considered. AMOOF must be accurate and robust due to the importance of the decisions that may be made based on this program’s output. The resources allotted to the development of the software is a limiting factor as well, because the development team, composed of four CWU students, has less than three months to work on it in addition to their studies. 2 Operation, Development, and Maintenance Environments There are a few constraints on the Operation environment that will be used by AMOOF. The operation environment of the program’s target platform is the Microsoft Windows 2000 or XP operating systems. In addition, AMOOF will require Microsoft Excel and the .NET Framework to execute. The computer must have 256mb ram and be a 1 GHz IBM compatible. Therefore, the operation environment is at a level in which users will need a current version of Windows running on a newer computer. Our package will be interacting with the Windows Operating System. The two main interactions will be with the Windows clipboard and the Windows file system. These interactions will allow the user to cut and paste from the clipboard and save and open data from files. To develop and maintain AMOOF will require a computer running Microsoft Windows XP. Microsoft Visual Studio .NET and Microsoft Excel will be used in developing the software package. The maintenance of our package will require the same environment as development. Therefore, the development and maintenance environments will only require Visual Studio .NET and Windows XP on a computer that meets the hardware requirements for running Windows XP and Visual Studio .NET. 3 System Model Two components of the System Model are calculations and plotting. The calculations are: data statistics, Maximum Likelihood Estimate (MLE) for the 22 probability density functions (PDF), and determining the best fit distribution. The plots are of the histogram of the data, the curve of the best fit probability distribution, and the histogram with best fit distribution function. Calculating the data statistics is straightforward. The statistics that are going to be calculated are mean, standard deviation, and six other moments of the data. We use the statistics when calculating the MLE. On the other hand, calculating the MLE is more difficult. To calculate the maximized MLE the statistics are used as a starting point. From there, an optimization to maximize the output of the likelihood function is performed. The program does these calculations for each of the 22 PDFs and each of the mixed distributions, if required. From the MLE calculations on each of the probability distributions, we compare the moments for each distribution to the moments of the data to find the relative error in the moments. The best fit probability distribution is the distribution with the least-relative error. The Data Flow Diagram in Appendix C describes the flow of data for calculating the best fit distribution. Creating the plots should not be overly difficult. The histogram will be based on the data set and the best fit probability distribution plot will be based on the best fit distribution. Once we program the individual plots, putting them together should be an easy process. There are three ways to enter data. The first is entering data directly into the spreadsheet. A second is to cut and paste the data into the spreadsheet. The last option is to import the data from a tab-delimitated file. Most users would like to save their calculations for future use. Our save workspace allows the user to save their work. If the calculations have been finished and the best fit probability distribution chosen, then these settings will save to the workspace as well. 4 Functional Requirements A system provides functionality or services, these are the functional requirements. The functional requirements of this project are Run Calculations Request Plots Enter Data Edit Options Save Workspace Open Workspace Run Calculations The run calculations requirement is the most involved of the requirements. There are two sets of calculations to be computed: calculation of the data statistics and calculation of the best fit distribution. The data statistic calculations will be simple to develop. The statistics of the data set input by the user are as follows. These calculations are the average, median, variance, and moments. Calculating the average is done by summing the data points and dividing by the number of data points. In mathematical notation the average is n x i m1 X i 1 . n Closely related to the average is the median. The median is more of a choice of a data point than a calculation. To calculate the median the data set is put in order and the middle data point is selected. If the number of elements is even then the average of the two middle data points is used as the median. In a normal distribution, the average is equal to the median. Variance is the amount of difference in the data. A sum is produced by squaring the differences between each data point and the mean and adding the products. This sum divided by the number of elements creates the average sum of products. The mathematical equation is n (X x ) i 2 m2 2 i 1 . n 5 The moments have a form similar to the variance. It is different in that the number of the moment is the power used for the difference in the variance. Therefore, the equation for th the k moment is n (X x ) i k mk i 1 . n AMOOF will use the above data statistic calculations in the more challenging calculations. The more complicated calculations are to find the best fit probability distribution using the likelihood functions for each of the 22 PDFs, and the mixed distributions, the conditional tale expectation, and the KS tests. The likelihood function is the product of functions which use the value of the data at each of the data points to determine the probability that it came from the given distribution. The likelihood function returns a number based on the PDF, the data point, and the parameters of the distribution function denoted as theta (Θ). Running an optimization on Θ to maximize the likelihood function will find the best fit of the current PDF. The moments of the data are used to guesstimate the initial value for the parameters. In turn, each of the maximized likelihood functions for the 22 PDFs will be calculated. When the best fit PDF has been determined, AMOOF will decide if a mixed distribution is required. The program will find the relative error between the first four positive and negative moments of the best fit distribution and the data set. AMOOF will also run a KS test and a Chi Square test to determine the accuracy of the PDF. The PDF is considered good enough if the relative error and the results of the tests are within the bounds set by the user. Default bounds will exist, but have yet to be determined. If the best fit PDF is not good enough, AMOOF will move on to find the best fit mixed distribution. Mixed distributions are distributions that use more than one PDF to describe a distribution. AMOOF will only consider weighted averages of two PDFs of the same class for mixed distributions. The technique for finding the best mixed distribution is much like the process for finding the best PDF, but there are twice as many parameters to optimize as well as the weights for each PDF. There are 22 available mixed distributions to consider. When the calculations are finished AMOOF will produce a report. The report will include the statistics of the data, the best fit distribution and its parameters. In addition, the report will include the conditional tail expectation (CTE) that gives statistics of the distribution one would get by only considering the data points a predetermined distance from the mean (the distance from the mean will be determined by the user). Request Plots Three different plots will be available. The plots are histogram, best fit probability distribution, and a combination showing both the histogram with the best fit PDF. The histogram plot displays a plot of the data set. The best fit probability distribution displays a plot of the curve that is fitted to the data set. The combination plot displays both a histogram of the data and the best fit probability distribution. This will allow the user to see how close the distribution fits the data set. 6 Enter Data To enter data there are going to be three methods: data entry, cut and paste, and importation from a tab-delimitated file. Data entry is the simplest entry method. This is done with a standard spreadsheet that has two columns, one for values and one for the frequency of the values. Cut and paste options are going to be available to move data to and from the Windows clipboard. This function will work the same as cut and paste in most other programs. The program will read data from a file. Data within the file will be tab-delimitated, meaning it will have a tab between each of the data points. Edit Options Many options will be available; however, we don’t have an exhaustive list of the options that will be available at this time. One option will allow the user to specify the x range of the graph. It will be possible to choose to search for only singular or mixed distributions. In addition the user will have the ability to change the parameters that determine if a distribution is good enough. Save Workspace Save workspace will save the data, statistics, and best fit probability distribution to a file. Open Workspace Open workspace will load data into the spreadsheet, initialize variables with the statistics, and load the selected best fit probability distribution. 7 Nonfunctional Requirements Nonfunctional Requirements are those that affect the development of AMOOF but are not features provided in the program. The nonfunctional requirements of AMOOF are: Accuracy Data Sets Resource Constraints User interface Accuracy AMOOF must be accurate and robust due to the importance of the decisions that could be made based on this program’s output. For example, if a company bases its rates on a distribution function and there was an error in the calculation of that distribution, there may be financial repercussions. Thus, accuracy is of utmost importance to AMOOF. To ensure accuracy, the development and testing of AMOOF must strictly follow the guidelines of our Quality Assurance plan. Data Sets AMOOF should handle large data sets common in the actuarial field. A company may have more then a million data points to work with. While a random sample of data points will give a good approximation of the whole data set, the more points that can be included in the calculations the better the distribution function will approximate the data. The feasibility of working with a million data points has not yet been tested yet. Therefore, the minimum number of data points AMOOF will work with is ten thousand, and if resources allow this number will be raised to one hundred thousand. Resource Constraints Resources for the development of AMOOF are in short supply. The deadline for delivery of the final project is in mid-March, approximately three months from the creation of this document. As students, the members of the development team for AMOOF have a limited amount of time to devote to the project each week. This creates a tight schedule under the best of conditions. User Interface The target users of AMOOF are those in the Actuarial Science field who frequently use standard office applications. To make AMOOF accessible to its intended users it must mimic an office application whenever possible. To accomplish this goal the user interface will be developed in Visual Studio .NET and standard Windows™ functionality will be used for all features. Standard Windows functionality includes tool tips, a menu bar and toolbars, among others. 8 Feasibility The deadline to have AMOOF delivered by the end of winter quarter puts constraints on our development process. The need to have the program completed in three months means that a minimum version of the system must be designed and then expanded upon as time allows. This section details the minimum version of AMOOF that delivers only the essential features followed by the planned version of AMOOF the development team hopes to deliver. In addition, an ideal version of AMOOF with everything that has been suggested is detailed, though it is highly unlikely there will be time to develop it. Minimum System This minimum version of AMOOF only provides the essential features for accomplishing the main goal of this program, to fit data to mixed distributions. The development team is confident it can produce this version of AMOOF at the very least. The features are: Data will be able to be imported from a tab delimitated file. Manual entry will be available via a spreadsheet. The maximum size of a data set will be ten thousand data points. A summary of the statistics of the data set will be available. Finding the best fit distribution: The best fit PDF will be determined using the MLE and MME techniques. The program will decide if a mixed distribution is required based on the CTE, KS Test and the relative error of the moments of the best fit PDF compared to those of the data. The best fit mixed distribution of a weighted average of two PDFs of the same class will be determined. A report giving the statistics of the data as well as the parameters of the best fit distribution and details on the goodness of fit. The report will be able to be copied for pasting into other applications. Data and all calculated statistics will be able to be saved to, and opened from, a file. Three graphs will be available: A histogram of the data. The best-fit distribution curve. A combination of the histogram and distribution curve. 9 Options will allow the user to: Specify the x range of the graph. Choose to look for only singular or mixed distributions. Change the parameters that determine if a standard distribution is good enough. Program will run on Microsoft Windows 2000 and XP with Microsoft Excel and the .NET Framework installed. Help menu links to User Manual. Planned System This is an upgrade from the minimum system version of AMOOF, including the nonessential features that are of the highest priority. The development team plans to deliver this version of the system. The additional features are: The ability to copy graphs. The ability to print the report, graphs and data sets. Maximum size of the data set will be fifty thousand. Ability to paste data from Excel. Program will run on Microsoft Windows 98 or later with Microsoft Excel and the .NET Framework installed. Help menu that links to an online searchable reference library including the User Manual. Ideal System This is a further upgrade from the planned system version of AMOOF, including features that are considered unnecessary, but would be nice to have. Features from the ideal system are not expected to make it into the delivered system. The additional features are: Add trial capability with trial available online. The ability to find mixed distributions using a weighted average of three or more PDFs. More columns for multiple data sets. The ability to find mixed distributions by partitioning the data set. Find statistics for the segment of the distribution. Program will run on Microsoft Windows 95 or greater. 10 Program will run on a System with 64mb of RAM or greater. Maximum size of the data set one hundred thousand. Option to show Chi Square output in the report. Ability to import from Excel files. 11 Appendix A Glossary 12 Actuarial Science The science of probability and statistics applied to insurance and company modeling; the science of evaluating the likelihood of future contingencies. AMOOF Actuarial Model Outcome Optimal Fit; the program this document is written for. Conditional Tail Expectation Distribution statistics based on the tail alone. CTE Conditional tail expectation. KS Test The Kolmogorov-Smirnov Tests if the data set is from that probability distribution. Likelihood Function Determines the probability a data set came from a particular distribution. Maximum Likelihood Estimate Technique for finding an optimal fit probability distribution for a data set. Methods of Moment Estimate Technique to estimate parameters of a distribution, used in this project to determine initial values of parameters for optimization. Mixed Distribution A distribution including more then one PDF, for the purposes of this project mixed distributions are weighted averages of two PDFs from the same class. MLE Maximum likelihood estimate. MME Methods of Moment Estimate. Model Refers to a model of the data, or a distribution function. .NET Framework Software required to run programs developed in Visual Studio .NET, a free update for Windows 98, 2000 and XP. PDF Probability density function. Probability Density Function Describes continuous probability distributions. Probability Distribution Gives the relative frequencies of all the possible outcomes of the random variable. 13 Appendix B Use-Cases 14 VISIO USE CASE DIAGRAM GOES HERE. 15 Use Case Name Start Program Actor Actuary Summary The actuary requests to run the program. Pre-Conditions Program is installed on the computer Normal Flow of Events 1. User clicks on program icon 2. Program displays splash screen 3. Program loads Error Conditions 1a. User does not have access to run program 3a. Not enough memory Concurrent Activities 1. Program displays splash screen 2. Program loads Post-Conditions Program is loaded and ready to be used Author Daniel Hanson and Matthew Miller 16 Use Case Name Import Data Actor Actuary Summary The actuary imports the data by selecting the import option from the file menu. Pre-Conditions 1. User has data in a comma delimited file 2. User knows location of the file 3. Program is running Normal Flow of Events 1. User requests importation of data 2. User browses to folder containing data file 3. User selects data file 4. Verify data file format 5. Load data into spreadsheet Error Conditions 2a. Location is inaccessible 3a. Requested file does not exist 3b. Error reading file 4a. File format is incorrect 5a. Data set too large for program 5b. Individual numbers too large Concurrent Activities None Post-Conditions Data is loaded into workspace Author Daniel Hanson and Matthew Miller 17 Use Case Name Enter Data Actor Actuary Summary The actuary enters the data by entering data into spreadsheet. Pre-Conditions 1. User has data 2. Program is running Normal Flow of Events 1. Set focus to spreadsheet 2. User enters or pastes data Error Conditions 1a. Spreadsheet not available 2a. User enters non-numeric data 2b. Data in clipboard is formatted incorrectly Concurrent Activities None Post-Conditions Data is entered into workspace Author Daniel Hanson and Matthew Miller 18 Use Case Name Open Workspace Actor Actuary Summary The actuary requests to open an existing workspace. Pre-Conditions 1. Program is running 2. A workspace file already exists 3. User knows location of file 4. Not Performing Calculations Normal Flow of Events 1. User requests to open an existing workspace 2. User browses to location 3. User selects file 4. User initializes open 5. Program reads in data 6. Program initializes variables and sets up spreadsheet Error Conditions 2a. Location inaccessible 4a. File access errors 5a. File not correct file format 5b. Data corrupt Concurrent Activities None Post-Conditions Data and statistics are loaded into the program Author Daniel Hanson and Matthew Miller 19 Use Case Name Edit Options Actor Actuary Summary The actuary requests to edit program options Pre-Conditions 1. Program is running Normal Flow of Events 1. User requests to edit options 2. Program displays options menu 3. User selects and changes option(s) 4. Program sets option(s) as specified Error Conditions 3a. Conflicting options Concurrent Activities None Post-Conditions Program option(s) set Author Daniel Hanson and Matthew Miller 20 Use Case Name Get Data Statistics Actor Actuary Summary The actuary requests data statistics Pre-Conditions 1. Data has been input 2. Program is running Normal Flow of Events 2. User requests data statistics 3. Program calculates statistics 4. Program opens a new window 5. Program displays statistics Error Conditions 2a. Probabilities don’t add up to 1 2b. Truncation error Concurrent Activities None Post-Conditions 1. Statistics calculated 2. Statistics displayed in new window Author Daniel Hanson and Matthew Miller 21 Use Case Name Run Calculations Actor Actuary Summary The actuary selects start calculations from the file menu. Pre-Conditions 1. Data has been input 2. Program is running Normal Flow of Events 1. User indicates to run calculations 2. Program validates data 3. Program calculates statistics 4. Program calculates a Maximum Likelihood Estimate (MLE) for each of the 22 PDFs 5. Program determines if mixed distributions are going to be used 6. Program finds MLE for mixed distributions if needed 7. Program decides best PDF or mixed distribution 8. Program displays distribution statistics Error Conditions 2a. Sum of probabilities does not equal one 3a. Truncation error 4a. Truncation error 4b. Numbers get too large 6a. Truncation error 6b. Numbers get too large Concurrent Activities None Post-Conditions 1. A best fit PDF 2. Distribution statistics Author Daniel Hanson and Matthew Miller 22 Use Case Name Request Plots Actor Actuary Summary The actuary requests to see the histogram and/or best fit plot. Pre-Conditions 1. Program is running 2. Not Performing Calculations Normal Flow of Events 1. User requests a plot of the data 2. Program prompts for plot type(s) 3. User enters plot type(s) 4. Program creates a new window 5. Program displays histogram and/or plot Error Conditions 3a. No data in spreadsheet 3b. User selects unavailable plot type Concurrent Activities None Post-Conditions A graphical display of the histogram and best fit plot Author Daniel Hanson and Matthew Miller 23 Use Case Name Save Workspace As Actor Actuary Summary The actuary requests to save workspace as another file or file type. Pre-Conditions 1. Program is running 2. Not Performing Calculations Normal Flow of Events 1. User requests Save As 2. User browses to location 3. User enters new filename 4. User initializes save 5. Program saves data, calculations, options and windows locations to file Error Conditions 2a. Location is inaccessible 3a. User enters incorrect filename 4a. Filename in use Concurrent Activities None Post-Conditions Workspace is saved to a file Author Daniel Hanson and Matthew Miller 24 Use Case Name Save Workspace Actor Actuary Summary The actuary requests to save workspace to a file. Pre-Conditions 1. Program is running 2. Not Performing Calculations Normal Flow of Events 1. User requests Save 2. If program not previously saved, calls Save Workspace As 3. Program saves data, calculations, options and windows locations to file Error Conditions 2a. Location is inaccessible Concurrent Activities None Post-Conditions Workspace is saved to a file Author Daniel Hanson and Matthew Miller 25 Use Case Name End Program Actor Actuary Summary The actuary requests to exit the program. Pre-Conditions 1. Program is running Normal Flow of Events 1. User requests to exit program 2. Program checks to see if the workspace has been saved 3. Program prompts user to save if applicable 4. Program calls Save Workspace if applicable 5. Program closes Error Conditions None Concurrent Activities None Post-Conditions Program is closed Author Daniel Hanson and Matthew Miller 26 Appendix C Data Flow Diagram of Calculations 27 DATA FLOW DIAGRAM GOES HERE 28