Anthropological Informatics

Document Sample
Anthropological Informatics Powered By Docstoc
					Anthropological Informatics

       Reality Measures
       or Reality bytes
   Measurement and Perception
“Take away number in all things and all things
  perish. Take calculation from the world and all is
  enveloped in dark ignorance, nor can he who does
  not know the way to reckon be distinguished from
  the rest of the animals.” St. Isidore of Seville
“And still they come, new from those nations to
  which the study of that which can be weighted
  and measured is a consuming love.” W.H. Auden
“In causal terms the presence of oxygen is a
  necessary but not a sufficient condition for
  fire. Oxygen plus combustibles plus the
  striking of a match would illustrate a
  sufficient condition for fire” William L. Reese
A Necessary and Sufficient Condition

 • Oxygen
 • Combustibles
 • Matches
     Visualization: The Match?
“Science and technology have advanced in more
  than direct ratio to the ability of men to contrive
  methods by which phenomena which otherwise
  could be known only through the senses of touch,
  hearing, taste, and smell have been brought within
  the range of visual recognition and measurement
  and thus become subjects to that logical
  symbolization without which rational thought and
  analysis are impossible.” William N. Ivins
“One of the fundamental traits of the mind of
  the declining middle ages is the
  predominance of the sense of sight, a
  predominance which is closely connected
  with the atrophy of thought. Thought takes
  the form of visual images. Really to impress
  the mind a concept has first to take the
  visible shape.” Johan Huizinga
• Modern: we feel that quantities are set and
  transactions are fair and equivalent
• Present : Past : with inspection, vagaries and
• In Roger Bacon 13th century, quanta differed from
  region to region and transaction to transaction
• A bushel of oats was nor more nor less than as
  many oats a bushel basket contained but a bushel
  for the lord would be heaped and a bushel for the
  peasant was no more than level with the rim (the
  differential was not cheating but a proper
Greek metrological relief

Conical sundial with hours in
Greek letters                   Greek multiplication wax tablet
                                        Roman measuring tools

Egyptian measuring gold rings
against a bull’s head weight

              Egyptian alabaster vase
              With volume marked as
              81/2 hennu
                                                Roman milestone

Facsimile of the Peutinger Table, a copy of a
Roman road map; Rome is at the center
Ptolemies’ “Geography”
            Changes in Vision

• A shift to the visual in the Middle Ages was the
  match that ignited the flame of quantification
• Change was marked in several main fields of
  human exertion:
• There was a shift in conduits of authority from the
  ear to the eye)
• In the 14th century devised new cursive script with
  word separation and punctuation for easier writing
  and reading
• Reading became swift and silent
• Literacy spread to classes beneath poets and
  philosophers: composers, painters and
• Renaissance Europeans considered music to be an
  emanation of the basic structure of reality
  (harmony guided the heavens)
• Gregorian chants were performed from memory
• By c. 10th century, accumulation of chants
  exceeded apprentices’ abilities to memorize
• Monks developed a system of “neumes” or signs
  to indicate highs and lows without a musical staff
• The musical staff was standardized by Guido of
  Arezzo, a 11th century Benedictine choirmaster
• Ut … re … mi … fa … sol … la … cut the
  training of a good singer from 10 years to 1 year
• 4 of the liberal arts considered essential for a solid
• Arithmetic
• Geometry
• Astronomy
• Music
Music and science: Galileo, Descartes, Kepler and
  Huyghens were all accomplished musicians and
  published on measurement in musical subjects

• Medieval artists were more concerned with
  rank of their subjects than with the faces of
  individuals (size = importance; space was to
  be filled by altering perspectives)
• In the 14th century, geometry begins to
  guide compositions (scenes were to be
  viewed by an observer at single point in
  time; perspective was adhered to)
“We shall ever give ground to honor. It will stand to
  us like a public accountant, just, practical, and
  prudent in measuring, weighing, considering,
  evaluating, and assessing, everything we do,
  achieve, think and desire.” Leon Battista Alberti (1440)
“Inasmuch as all things in the world have been made
  with a certain order, in like manner they must be
  managed … of the greatest importance, such as
  the business of merchants, which … is ordered for
  the preservation of the human race.” Benedetto de
  Cotrugli (15th c.)
The merchant struggling to make
 sense of his books was a theme
• Blizzards of transactions, scrambled by
• Bills of exchange
• Promissory notes
• Credit practices
• Axiom: production preceded delivery
• Reality: payments could precede delivery or
• Payments were undulatory, with currencies and
  bills of exchange billowing and plunging in value
  in relation to one another
  RECORDS …. My god me we need
    records or what will we know?

• By the end of the 14th c. Hindu-Arabic numerals
  were beginning to appear in merchants’ account
• Double-entry accounting systems were developed
  (ingoing and outgoing values; plus and minus);
  great improvement over narrative accounts
• By the 15th century, an accounting lexicon and
  guides to practice were being developed
           Visions and Models

“I often say that when you can measure what
  you are speaking about and express it in
  numbers you know something about it; but
  when you cannot measure it, and when you
  cannot express it in numbers, your
  knowledge is of a meagre and
  unsatisfactory kind.” William Thompson, Lord Kelvin
        Our Information Age:

• All information incomplete. There is always
  more to know, always another way to
  reframe what is already known. Our leaders
  must make important decision on the basis
  of incomplete information
• Information does not narrow the range of
  choices; it widens it. Further information is
  likely to make any decision-making process
  more meaningful and effective. It may not
  make the decision easier.
• Information is always subject to multiple
  interpretations and constructions. “Data” is
  nothing until it is given meaning and
  assembled in a narrative.
• Information comes in many forms: data, stories,
  myths, visual images, and meta-theories.
  Information theorists do not regard data as
  information at all. It is potential information.
  Information is data endowed with relevance and
• Data are undigested facts.
• Information are facts organized for you by
  someone else but not yet absorbed into your own
• Knowledge is information that you have
• Different people speak different information
  languages even when they are speaking the
  same language.
• Information leaks. In our information
  society nobody keeps secrets. There is an
  erosion of confidentiality that accompanies
  the inundation in information through
• Information once distributed is almost
  impossible to destroy. Information has its
  own survival skills.
          Information Production
                           • about 10 exabytes
                           • 90% digital
                           • 55% personal
                           • print .003% of bytes
                           • email is 4 PB/y
                           • www is about 50 TB
                           • growth at 50% y

Gray and Szalay 2003
             The First Disk 1956

•   IBM 305 RAMAC
•   4 MB
•   50 X 24” disks
•   1200 rpm
•   100 ms access
•   $35K/y rent
•   Included computer and
    accounting software
10 Years Later

                 30 MB
Cost of Storage
      Storage Capacity Outstrips
            Moore’s Law
• Improvements
   Capacity 60%/y
   Bandwidth 40%/y
   Access time 16%/y
• $1000/TB today
• $100/TB in 2007
Moore’s Law: 58.7%/y
TB growth: 112.3%/y
Price decline: 50.7%/y
             Moore’s Law

• Performance/price doubles every 18 months
• 100 X per decade
• Progress in next 18 months will outstrip all
  previous progress (new storage sums all
  previous storage and new processing will
  outstrip all old processing)
      Rules of Thumb for Data
• Moore’s Law: an address bit per 18 months
• Storage grows 100 X/decade (1000X in last
• Disk data of 10 years ago now fits in RAM
• Device bandwidth grows 10X/decade (need for
• RAM:disk:tape price is 1:10:30 and will go to
• Gilder’s Law: aggregate bandwidth 2X/8 months
• Web Rule: cache everything
    Filling A Terabyte In A Year

 Item                  Items/TB   Items/day
 300 KB JPEG           3M         9,800
 1 MB Doc              1M         2,900
 1 hour 256kb/s        9K         26
   MP3 audio
 1 hour 1.5 Mbp/s      290        .8
   MPEG video

Gray and Szalay 2003
          Schematized Storage
• File metaphor too primitive: just a “blob”
• Table metaphor too primitive: just “records”
• Need metadata describing data context
   – Format
   – Providence (author, publisher, citations)
   – Rights
   – History
   – Related documents
                             • in a standard format
                             • XML and XML schema
                             • Data Set is a great example
                             • World is defining standard schema
           Keys for Storage

• Schematized storage can help organization
  and research
• Schematized XML data sets are a universal
  way to exchange data
• Data are objects, and so, need standard
  representation for classes and methods
Access Variable and Increasing
             Stages in Science
• Observational Science
  Scientist gathers data by direct observation
  Scientist analyzes data
• Analytical Science
  Scientist builds analytical model
  Makes predictions
• Computational Science
  Simulate analytical model
  Validate model and make predictions
• Data Exploration Science: data captured by instruments or
  data generated by simulator
  processed by software
  places in a database as files
  Scientist analyzes database files
            Data Avalanche

• Better observational instruments and better
  simulations are producing an avalanche of
        Discoveries Booming

• Conceptual discoveries (relativity, quantum
  mechanics) and theoretical may be inspired
  by observations
• Phenomenological discoveries (dark matter,
  obscured universe) made by advances in
  empirical rigor; inspires theories and is
  motivated by them
            Discovery Cycle
•   New technical capabilities
•   Observational discoveries
•   Advances in theory
•   Application of new theories

Phenomenological discoveries: exploring parameter
  space; making new connections
Maxim: understanding complex phenomena requires
  complex, information rich data and simulations
            How to Keep Up
• We are looking for “needle in haystacks” (the
  Higgs particle in dark matter)
• Needles are easier than haystacks
• Global statistics have poor scaling
• As data and computers grow at the same rate, we
  can only keep up with N log N
• Discard notion of optimal: data are fuzzy and
  solutions are approximations
• Require combination of statistics and computer
Analysis of Databases
        •   Create uniform samples
        •   Filter data
        •   Assemble subsets
        •   Estimate completeness
        •   Censor bad data
        •   Count and build histograms
        •   Generate Monte Carlo subsets
        •   Perform likelihood calculations
        •   Test hypotheses

        These tasks are best done inside
          databases (“bring Mohamed to the
           Go for Smart Data
• Too much data to move around, so take analysis to
  the data
• Do all data manipulations inside the database
  (build custom procedures and functions in the
• Guaranteed automatic parallelism
• Easy to build custom functionality key (pixel
  processing, temporal and spatial indexing, unified
  databases and procedures)
• Easy to reorganize data (multiple views make
  optimal analyses)
• Scalable to Petabyte data sets
             Data Mining Images

We can discover new types of phenomena using automated pattern
recognition; multiscale analyses
            Optimal Statistics
• Statistics algorithms scale poorly
• Even if data and computers grow at same rate,
  computers can do at most N log N algorithms
• Solutions:
  assume infinite computational resources
  assume only source of error is statistical
  there is a finite sample size
Solutions will require combinations of statistics and
New algorithms will not be worse than N log N
   Make Clever Data Structures

• Use of tree structures
• Fast, approximate algorithms
• Must account for computation costs
  scale level of accuracy
  shoot for “best” results given …
• Explore parameter spaces in catalog
  domains through
   – Clustering analysis (different types and
   – Multivariate correlations (find significant,
     nontrivial correlations in the data)

Visualization becomes the key; include interactive
  visualization and data mining processes
              Publishing Data

• expectations and standards must change
• there will be exponential growth
• projects must become more responsible
Archaeological Informatics

Organizing Piles of Articulated and
    Disarticulated Information
      “Great Chain of Being”
• Stewart (1997) summarized the course of
  archaeological information moving to
  information as the GCB: moving from
  logical stages in data collection, to data
  management, to data analysis, and to
  variable modes of dissemination
• Use of Information Technology (IT) was to
  be seen as a multistranded web rather than
  as a linear feature on the computing
            Archaeological IT
• Quantitative methods
• Statistics and
  classification            All require
• Archaeometry              • Digital archives
• Visualization             • Databases
  (imaging, CAD,
  multimedia and virtual
• Expert systems
• Artificial intelligence

• Term supplanted “databanks” in the 1980s
• Concept linked to increased availability of
• Emphasis accompanies shift to industry standard
• Enhancement is a profound goal of government
  organizations as they move toward encompassing
  strategies for digital data management
            Access to Data

• Has emerged as the primary hot button of
  the 21st century
• Digital archives are being built but data
  languishes, unsorted and unavailable
• The backlog of information is huge and
• Technological fixes are available but
  implementation is a social problem
              Techno Science

• Use of electronic media to enhance scientific
  communication is a huge shift in the conduct of
  basic science
• Scientists want pure access to information
• Potential for cross-disciplinary and international
  collaborations is booming
• Keys are building adequate metadata, migrating
  data, and controlling access to information
            There are Risks

• We cannot allow transformation of
  scientific communication to occur in a pure
  laissez-faire environment
• We cannot assume that everyone will catch
  on the using e-media structures
• We cannot assume that various e-media
  initiatives represent a period of problem-
           What’s Out There?
• Run-away agendas and competing
  proprietary interests that will seek to retard
  powerful e-venues
• Huge amounts of money and resources are
  being committed by government agencies,
  private firms and organizations, by
  academics, by publishers, by professional
  societies, and individual researchers for
  development, maintenance and promotion
  of all sorts of competing e-media and for
  proprietary e-markets
         Practical Problems
• Scientists and policy-makers do not have
  accepted theory for shaping IT
• Producers and users work within context-
  free models
• Work consists of ongoing prototyping and
  fledgling projects with high promise and
  withered funding
• The result: wasted funding, and orphaned
  data left in marginal, decaying, dead
  systems and formats
      Responses: E-com reform
• Extends across all e-media
• Spokesmen include Paul Ginsparg and Paul
• Harnard urges decentralized scholarly publishing
  peer-reviewed or not (editor of Psycholoquy);
  originator of “scholarly skywriting
• Ginsparg is developer of the Los Alamos National
  Labs Physics E-Print Server, working papers for
  high-energy physicists
• Future: move away from hard-copy journals and
  archives in all forms, centralized and decentralized
            Reform Ideology

•   E-media is better than traditional media
•   E-communication will be less expensive
•   Access to e-media will be easier and wider
•   Systematic use of e-media will dramatically
    speed up scientific communication
           Subversive Actions
• Editors of Electronic Transactions on Artificial
  Intelligence (ETAI) have created a completely
  open article review process
• Phase I: article is open to the public online for 3
• Phase II: after author response, the article
  is reviewed for acceptance using confidential peer
  review and journal level quality criteria
• The Journal of Artificial Intelligence (JAIR) uses
  online appendices and discussions of published
• JAIR is distributed without charge on the Internet
           Social Designing

• Electronic access to resources that include
  primary data
• High speed of work- and results-sharing
• Selection of target audiences for research
• Allocation of proper credit for work
• Allocation of professional status based on
  quality of data design and data sharing
                Market Forces

• Industrial and corporate support for research
  creates authoritative, owner-driven sanctions on
  information dissemination
• These distribution systems are opaque, hidden
  behind secure doors
• Data release is carefully controlled, if allowed,
  and timing is completely geared to coporate
  advantage and profit-making
• Two poles: open access (transparent) and
  controlled access (opaque)
      “Boom and Bust Cycles”
• “Worm Community System” for molecular
  biologists proved too complicated and costly for
  most users
• WCS was recast as A.C. Elegans DataBase
  (ACEDB), which has found greater acceptance
• Many biologists invested in the “Genome
  Database” only to see financial support withdrawn
• The “Archaeological Data Archive Project,” much
  celebrated, is now dead for lack of clientele
 Liberating Archaeological Data

• Perring and Vince (1999) set out a guide for
  bringing complex archaeological data out to view
• They cite Hodder (1998) on the impact of the
  Internet in organization of archaeological
  knowledge, with a shift from hierarchical
  structures to network flows
• The veil: many archaeologists, working under
  Federal and State mandates, remain outside any
  long term concern with data handling
• Data liberation runs afoul of insistence on
  fossilized traditional research practice, fueled by
  resource management contracts
          Need for Re-thinking
• Archaeological classification practices will need to
  emphasize optimal structures for organization of
  archaeological data in an electronic environment
• Interpretive structures must admit variable ways of
  grouping data
• Higher order groupings (typologies) will have to
  be supplemented by alternative analytical
  groupings (material classes, deposition classes)
• Data structures will have to be flexible and
New Structures Must Recover Links
• Traditional databases (TDs) have disparate or
  unlinked compendiums (fields with specimen
  measurements but no link to “grey literature”
• TDs typically are arranged to follow a rigid linear
  structure based on chronological groupings
  dictated by field recovery records and publishing
• This produces intractable data sets, where
  important data remain unavailable because
  reclamation costs are so high, there is a lack of
  integration for specialist data to be linked with
  overall data structure, and little potential for futrue
             New Methods

• Proviso: we cannot enter new data as old
  structures into new IT (HTML,
  interrelational databases, and GIS) and
  expect working databases
• The theory-driven structure of the data must
  be revied
      SAA 2000 position paper

• “Digital Data: Preservation and Re-Use” promoted
  ideas on improvements
• Robinson’s “Digital Archiving Pilot Project for
  Excavation Records” (DAPPER) reviewed
  projects’ data handling
• A central concern was the user interface, whether
  it should be designed for aesthetics or for clean
  access to data
• Argued for data preservation in standard formats
  as proposed by the UK Archaeology Data Service
              Cost Measures
• Digital archiving of Eynsham Abbey collections
  cost 1.2% of excavation and post-excavation
• Digital archiving of the Royal Opera House
  collections cost .1% of the total project cost
• Upshot: CAD archives, arranged as separate files,
  is more cost efficient for non-specialist venues,
  while GIS is the more powerful research tool but
  requires specialist training
     Levels of Digital Archives
• Index level archive: index record for ADS catalog
  and summary document; not further work
• Assessment level archive: index record, project
  design, assessment report, specialist level
  databases, and site matrix
• Research level archive: above, with analytical
  results and publications
• Integrated archive: above, with records of ongoing
  scholarship, linking text files with other data
• Must ensure reuse of data: Eiteljorg
  emphasizes need for user training in CAD,
  GIS and database software
• Data translations are tricky: any
  relationships within software must be
  identified (data segments in CAD layers or
  DBF relations and links)
• Assessments of:
  – Systematic collection methodology
  – Record of data corrections
• Data about data, providing information essential to
  data use and reuse
• Can refer to agreed upon sets of fields and
  associated lexicons
• Can consist of detailed descriptions of
  measurement systems and rules for their
• Data users need metadata to make intelligent
  decision in selecting, using, adding to, or
  translating databases
Increasing Number of Standards
• MARC, Machine Readable Catalog, library
• Text Encoding Initiative (TEI), standard
  descriptions of machine readable text
• Directory Interchange Format (DIF),
  metadata for satellite imagery
• U.S. National Spatial Data Infrastructure
  (NSDI), complex descriptions of spatial
                Dublin Core
• Seeks to supply metadata descriptions between
  crude metadata of search engines and complex
  systems developed for MARC and the Federal
  Geographic Data Committee
• Can describe resources on the Internet and to
  insert file types (HTML and various postscript
• DC is extended as separate frameworks as in the
  Warwick Framework (descriptions can be stored
  as DIF or FGDC, or as simple extensions of the 13
  DC elements)
       Metadata and Databases

• Metadata should act to improve or restrict
  access to data
• Facilitate sharing and interoperability
• Characterize and index data
                Data Models
• Data are a model of the real world
• The description is arbitrary and biased
• Data models incorporate different data views
• Key issues: verification, validation and
  certification of data quality
• Measures: objective correctness (accuracy and
  consistency) and appropriateness defined by
  intended purpose
• Required: all data must be augmented with
  metadata to record information needed to assess
  data quality, record results of assessments, and
  support process control
     Measures for Data Quality
• Adequate description and meaning
• Specification intended use and range of
  purposes and constraints
• Requirements for access and use
• Description and rationale for structure and
• Global relationships to other databases
• Updated cycle information
           Data Deterioration
• Limited media life
• Rapid obsolescence of software and hardware
• Use of graphics, hypertext and linked structures
  only accelerates decay rates
• Data files will become increasingly dependent on
  specific software for continued interpretation
• Record keeping paradigms are essential
  (compression is not an option; annotated metadata
  must remain transparent)

• Archaeological data and information are
  growing exponentially
• New data paradigms must be created
• Effects on theory and method will be
• Effects on the culture of the discipline will
  prompt profound dislocations

Shared By: