Chapter-7: Case Study
www.AhsanAbdullah.com
1
(c) 2008 Dr. Ahsan Abdullah
Background
Data being recorded for decades by several
organization, mostly never digitized and never
used for decision making. Under-utilization.
Data is horizontally wide i.e. 100+ attributes and
vertically deep i.e. tens of thousands of rows.
Huge potential for long-term and short-term
decision making.
Decision making not data driven, but based on
“expert” judgment, sometimes with tragic results.
2
(c) 2008 Dr. Ahsan Abdullah
Area under study
Punjab is the bread-basket of Pakistan, and
administratively divided into eight divisions.
Multan division is the cotton hub of Punjab,
which consists of district Multan.
District Multan has three tehsils, which are
furhter divided into central points or Markaz.
This study is about 10 tehsils of Multan
using data for years 2001 and 2002.
3
(c) 2008 Dr. Ahsan Abdullah
Area under study: Map
Key Markaz
1 Bosan
2 Qadirpurran
3 Multan
4 Makhdum Rashid
5 Mumtazabad
6 Shujabad
7 Hafizwala
8 Jalalpur Pirwala
9 Qasba Marral
4
(c) 2008 Dr. Ahsan Abdullah
Major Players
A pest is an insect that eats the crop. There
are two types of cotton pests:
Sucking pests (White Fly, Jassid, Thrips etc.)
Boll Worms (spotted boll-worm, pink boll-worm)
A predator is an insect that eats the pest.
Such as Lady bug Beetle, Spiders, Ants etc.
Cotton is also effected by virus
Cotton Leaf Curl Virus (CLCV)
5
(c) 2008 Dr. Ahsan Abdullah
Economic Threshold Level (ETL_A)
The pest population beyond which it is cost
effective to use pesticide.
Pesticide is a poison which is used to kill
pests.
Note that eradicating pests is NOT feasible,
controlling pest population is feasible.
6
(c) 2008 Dr. Ahsan Abdullah
Pests & Predators
Pests Predators
Source: United States Department of Agriculture (USDA) 7
(c) 2008 Dr. Ahsan Abdullah
ETL_A: Graph
Economic Injury
ETL_A
Pest Population
Time
8
(c) 2008 Dr. Ahsan Abdullah
The need
Extensive losses to cotton crop due to pest attacks
in 1983 resulted in the establishment of Directorate
of Pest Warning in 1984.
Since 1984 scouts from the Directorate have been
sampling fields and recording data and advising
farmers.
During 2003-04 season, Boll Worm attack on the
cotton crop resulted in a loss of nearly 0.5 M bales.
Weather not the only factor, but points to a multitude
of factors, requiring efficient and effective data
analysis, for better decision making.
(c) 2008 Dr. Ahsan Abdullah
9
The need: IT in Agriculture
The volume of pest scouting data accumulated todate is
enormous both horizontally and vertically.
A typical pest scouting sheet consists of 35 variables or
attributes.
Metrological data consists of 50+ variables.
Coarse estimate of pest scouting data recorded for the cotton
crop alone stands at 5+ million records, and growing.
Tasking the human brain alone, for synthesis of information
from this data is not only impractical but dangerous too.
Need a Data Warehouse, OLAP tools and Data Mining to
analyze the data. 10
(c) 2008 Dr. Ahsan Abdullah
Agro-Informatics
“I.T. sector is at the heart of the economic
revival of Pakistan” President of Pakistan,
Launching of VU, Mar. 23, 2003.
Agriculture is the backbone of our economy,
upto 74% of the population is dependent on it.
IT is an enabler, and has the potential to benefit
everyone when applied in Agriculture.
IT + Agriculture: A win-win scenario.
11
(c) 2008 Dr. Ahsan Abdullah
Agro-Informatics: Relevant URL
To know more about Agro-Informatics, visit
www.agroict.org
12
(c) 2008 Dr. Ahsan Abdullah
How to go about?
Discussed several DWH implementation
methodologies in lectures 32-35.
Will adopt a pilot project approach, because:
A full-blown DWH requires extensive investment.
Show users the value of DSS.
Establish blue print for full-blown system.
Identify problem areas.
Reveal true data demographics.
Pilot projects are supposed to work with limited
data. 13
(c) 2008 Dr. Ahsan Abdullah
The 12-step Approach of Shaku Atre
14
(c) 2008 Dr. Ahsan Abdullah
Step-1: Determine User’s needs
A data warehouse or a data mart?
User not educated, think DWH is a big Excel sheet.
Some confuse DSS with Decision Making Systems.
DWH at Federal level.
Data marts at district level.
What are the business objectives?
Good governance via E-Governance.
Timely and data driven decision making.
Fore warning of farmers.
Timely planning for purchase/stocking of pesticides.
15
(c) 2008 Dr. Ahsan Abdullah
Step-1: Determine Users needs
Cost/benefit analysis, project estimation & risk
assessment
Cost of a prototype system fraction of the loss in one
year.
Required data either not made available or expensive.
Required data in very raw format.
People hungry for ideas looking for a solution on a plate.
Where do I get the data from?
Data available at Directorate of pest warning.
Raw form
Metrological department asking too much for data
collected using public funds..
Weather forecasts downloaded from web site of
16
www.dawn.com (c) 2008 Dr. Ahsan Abdullah
Steps-2&3: Determine DBMS Sever & Hardware Platform
Hardware and Software along with trained
people was already available on Campus.
The pilot DWH was implemented using NCR
Teradata Data Warehousing solution for NT
running on a dual Intel 950 Mhz Xeon
processors server and 1GB of RAM.
Total internal Hard Disk capacity of the server
amounted to 36 GB while external RAID
control supports 8 additional SCSIs of 18 GB
each. 17
(c) 2008 Dr. Ahsan Abdullah
Step-4: Information & Data Modeling
Dimensional Modeling
18
(c) 2008 Dr. Ahsan Abdullah
Step-4: Information & Data Modeling
Simplified ERD
Other field inputs such as irrigation, fertilizer etc. not included as data not
19
available
(c) 2008 Dr. Ahsan Abdullah
Step-4: Information & Data Modeling
KEY
WFN: White Fly Nymph
WFA: White Fly Adult
W: White
B: Brown
S: Small Larvae
L: Large Larvae
(optional)
Logical & Physical Design 20
(c) 2008 Dr. Ahsan Abdullah
Step-5: Construct Metadata Repository
What means what was important.
Who access what was not important as single
user environment.
Meetings held similar to requirements
definition, and questions transformed into
SQL queries.
Meta data issues still cropped up, more of
business rules.
21
(c) 2008 Dr. Ahsan Abdullah
Step-5: Surprise case
0.3
Sucking pests Ball Worm Complex
0.25
0.2
Correlation
SBW: Spotted Ball Worm
0.15 ABW: Army Ball Worm
PBW: Pink Ball Worm
0.1
0.05
0
PBW_RF
SBW
PBW_Bolls
ABW_White_Eggs
ABW_Brown_Eggs
WhiteFly_Nymph
WhiteFly_Adult
Thrip
Mite
ABW_Larvae_Small
Jassid
ABW_Larvae_Large
-0.05
-0.1
-0.15
-0.2
If pest population is low, predator population will also be low, because there will
22
be less “food” for(c)predators to live on i.e. pests.
2008 Dr. Ahsan Abdullah
Step-6: Data Acquisition & Cleansing
Hand filled pest scouting sheet
23
Typed pest scouting sheet
(c) 2008 Dr. Ahsan Abdullah
Step-6: Issues
The pest scouting sheets are larger than A4 size (8.5” x
11”), hence the right end was cropped when scanned on a
flat-bed A4 size scanner.
The right part of the scouting sheet is also the most
troublesome, because of pesticide names for a single
record typed on multiple lines i.e. for multiple farmers.
As a first step, OCR (Optical Character Reader) based
image to text transformation of the pest scouting sheets
was attempted. But it did not work even for relatively clean
sheets with very high scanning resolutions.
Subsequently DEO’s (Data Entry Operators) were
employed to digitize the scouting sheets by typing.
24
(c) 2008 Dr. Ahsan Abdullah
Step-6: Why the issues?
Major issues of data cleansing had arisen
due to data processing and handling at four
levels by different groups of people
1. Hand recordings by the scouts at the field
level.
2. Typing hand recordings into data sheets at the
DPWQCP office.
3. Photocopying of the typed sheets by DPWQCP
personnel.
4. Data entry or digitization by hired data entry
operators.
25
(c) 2008 Dr. Ahsan Abdullah
Step-7: Transform, Transport & Populate
26
(c) 2008 Dr. Ahsan Abdullah
Motivation For Transformation
Trivial queries give wrong results.
Static and dynamic attributes
Static attributes recorded repeatedly.
27
(c) 2008 Dr. Ahsan Abdullah
Step-7: Resolving the issue
Solution: Individualization of cultivated fields.
Technique similar to BSN used to fix names.
Unique ID assigned to farmers.
BSN used again, and unique ID assigned to fields.
Results:
Limitation: Field individualization not perfect. Some cases
of farmers with same geography, sowing date, same
variety and same area. Such cases were dropped.
28
(c) 2008 Dr. Ahsan Abdullah
Step-8: Middleware Connectivity
Since the source data is maintained in a
non digital format, hence connectivity with
the data warehouse was irrelevant.
Once digitized, it was rather straightforward
to load data into the warehouse.
Furthermore, in the foreseeable future, it
was not anticipated that the scouting
sheets were going to be maintained in a
digitized form. (c) 2008 Dr. Ahsan Abdullah
29
Step-9-11: Prototyping, Querying & Reporting
Implemented the prototype with user involvement.
Applications developed
10. A data mining tool was also developed based on an indigenous
technique that used crossing minimization paradigm for
unsupervised clustering.
11. A low-cost OLAP tool was indigenously developed; actually it
was a Multi dimensional OLAP or MOLAP.
Use querying & reporting tools
The following SQL query was used for validation:
SELECT Date_of_Visit, AVG(Predators),
…………………………AVG(Dose1+Dose2+Dose3+Dose4)
FROM Scouting_Data
WHERE Date_of_Visit 0
GROUP BY Date_of_Visit; 30
(c) 2008 Dr. Ahsan Abdullah
Step-12: Deployment & System Management
Since a pilot project, therefore, the traditional
deployment methodologies and system
management techniques were not followed to
the word, and are not discussed here.
31
(c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Data Validation
Quality and validity of the underlying data is the key to
meaningful and authentic analysis.
After ensuring a satisfactory level of data quality (based on
cost-benefit trade-off) extremely important to scientifically
validate the data that the DWH will constitute.
Some very natural checks were employed for this purpose.
Relationship between the pesticide spraying and predator
(beneficial insects) population is a fact well understood by
agriculturists.
Predator population decreases as pesticide spray increases
and then continually decreases till the end of season.
32
(c) 2008 Dr. Ahsan Abdullah
0
2
4
6
8
10
07/06/01
07/09/01
07/12/01
07/15/01
Predator
07/18/01
07/21/01
07/24/01
Spray
07/27/01
07/30/01
08/02/01
08/05/01
08/08/01
08/11/01
(c) 2008 Dr. Ahsan Abdullah
08/14/01
08/17/01
08/20/01
08/23/01
08/26/01
08/29/01
09/01/01
09/04/01
09/07/01
33
Agri-DSS usage: Data Validation Graph
Agri-DSS usage: Spray Dates
Working Behaviors at Field Level: Spray dates
As expected, apparently spray dates and spray
frequency for 2001 and 2002 do not displayed any
correlation; as it is dependent on pest populations,
availability of pesticides etc.
For deep analysis, moving average of sprays for five
days, and a moving correlation of sprays for five days
were calculated and plotted.
For the sake of uniformity, the moving average of
spray was normalized using the maximum spray
frequency. (c) 2008 Dr. Ahsan Abdullah
34
Agri-DSS usage: Spray Dates Graph
0.90 Moving Avg
Correlation
Relative values
0.70
0.50
0.30
0.10
8_2
8_7
9_1
9_6
7_23
7_28
8_12
8_17
8_22
8_27
9_11
9_16
9_21
9_26
10_1
-0.10
-0.30
Spray dates (mm_dd) for 2001 & 2002
-0.50
No relationship should have existed for the two years. But note the
surprising finding that most sprays occurring on and around 12th Aug. in
BOTH years with high correlation, appearing as a spike.
Also note the dip in sprays around 11th Sep.! Sowing at predetermined time
makes sense, as it is under the control of the farmer, but that is not true for
spraying. Pests don’t follow calendars; therefore, whenever, ETL_A is
crossed pesticides are sprayed. 35
(c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Explaining Findings
14th Aug. is the independence day of Pakistan and a national holiday.
In Pakistan, people are in a habit of sandwiching gazetted holidays with
casual leaves; consequently businesses are closed for a longer period,
including that of pesticide suppliers.
14th Aug. occurred on Tue and Wed in 2001 and 2002, respectively, thus
making it ideal to stretch the weekend.
During Aug/Sep. humidity is also high, with correspondingly high chances
of pest infestations.
Therefore, apparently the farmers decided not to take any chances, and
started spraying around 11th Aug.; evidently even when it was not required.
The weather forecast for 13 Aug. 2001 and 2002 was showers and cloudy,
respectively. Therefore, most likely the pesticide sprayed was washed-off.
Decline in sprays around 9/11 could not be explained.
36
(c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Sowing Dates
The results of querying the sowing date based on the day of the week are
shown below:
2001: Sowing week_day 2002: Sow ing w eek_day
500 431 429 500
405 398 409 387
400 367 357
400
303 278
300 300 223
179 174
200 200 124
100 100
0 0
Fri
Fri
Tue
Thu
Mon
Tue
Thu
Mon
Wed
Sun
Sat
Wed
Sun
Sat
Observe least number of sowings done on Thursdays, in each year. This finding
was later confirmed by extension personnel.
Multan is famous for its shrines. Thursdays are usually related with religious
festivals and activities, a mix of devotion and recreation, and usually held at
shrines, hence a tendency of doing less work on Thursdays. Similar behavior was
observed for spraying too.
37
(c) 2008 Dr. Ahsan Abdullah
Conclusions & Lessons
ETL is a big issue. There are no digitized operational databases, data entry of
these sheets is very expensive, slow and prone to errors.
Each farmer is repeatedly visited by agriculture extension people. This results in
repetition of information, about land, sowing date, variety etc. Hence, farmer and
land individualization are critical. Such an individualization task is hard to
implement for multiple reasons.
There is a skewness in the scouting data. Public extension personnel (scouts)
are more likely to visit educated or progressive farmers, as it makes their job of
data collection easy. Furthermore, large land owners and influential farmers are
also more frequently visited by the scouts. Thus the data does not give a true
statistical picture of the farmer demographics.
All pests are not present all the time (such as ball worms), most of the time
second spray is not done (or not recorded), and hence tables are sparse.
Therefore, it is better to split tables so as to decrease header size and table
space.
Unlike traditional data warehouse where the end users are decision makers,
here the decision-making goes all the way “down” to the extension level. This
presents a challenge to the analytical operations’ designer, as the findings must
be fairly simple to understand and communicate.
38
(c) 2008 Dr. Ahsan Abdullah