introduction to digital libraries_ the ssh protocol

Document Sample
introduction to digital libraries_ the ssh protocol Powered By Docstoc
					         LIS654 lecture 1

Introduction to the course, the ssh

           Thomas Krichel
               what's up doc?
• This lecture is to introduce the topic of digital
• First part: we look at the nature of digital
  libraries. This part is informed by the first
  chapter of Arms book.
• Second part: look at repository planning. This
  part is misinformed by the Reese and
  Barnarjee book.
• Then we talk ssh. A sorry excuse to play with
              first part contents
•   digital libraries
•   digital librarianship
•   a course on digital libraries
•   with the aim of training digital librarians
              digital libraries
• Generally, we can think about digital libraries
  – information stored on a computer
  – delivered via a network
  – mimics existing libraries
• As Arms puts it “a managed collection of
  information, with associated services, where
  the information is stored in digital formats and
  accessible over a network”.
• We are at the start of digital libraries.
• The problem is that the technology is still
  expensive, the cost is still coming down.
• The opportunity is that we can build
  pioneering systems now, that will have a
  lasting social impact.
• ISI journal citation report is based on two
  years of data of citations to journals.
• When Eugene Garfield founded it, he
  published the report in the second year of
  getting data.
• For the next issue, he chose the same horizon
  of data.
• Citation rankings of journals still use 2 years,
  almost 50 years after.
           benefits: availability
• Digital libraries bring the information closer to
  the user than physical libraries can
   – physically
   – temporarily
• Even when you are in the physical library you
  still get faster access to digital library items.
           benefits: findability
• Information can be more easily found in digital
  than in print.
• Some non-textual information is still only
  findable via metadata.
• But computer scientists are working on that.
            benefits: sharing
• Information can be shared.
• Items can not be damaged.
• Items can not be stolen.
           benefits: updating
• Information can be kept up-to-date more
• To update a book, you have to reprint all
  copies, and replace them.
         benefits: new media
• Information can be created and manipulated
  in completely new ways.
• For example location information can be
  mixed up with subject information.
                issue: costs
• The cost of storing print information is very
  high. It is a multiple of acquisition costs.
• Digital storage devices decline in price.
• But digital information manipulation requires
  skills that are not easy to procure.
• The overall cost comparison is difficult to
        drawback: preservation
• Preserving information is easy on paper.
• Preserving digital information looks very hard.
• We will not look at this issue in the course,
  because there is a specialized Palmer School
  course dealing with this.
   drawbacks: monopoly dangers
• Since the information only needs to be kept in
  one copy, and others can access it, there are
  inherent dangers of the build-up of
• One example is Google search engine.
     drawbacks: free information
• Since the information is more easy to copy it is
  harder to police illegal sharing.
• Some creators and intermediaries are feeling
  the pinch.
• The newspaper industry is one.
• Physical libraries are one potential victim.
 drawbacks: professional upheaval
• Digital librarianship is as yet, largely
• This leads me to the next topic.
           digital librarianship
• Librarianship has always been a bicephal
• Libraries always have a collection and service
  aspect them.
• Digital libraries are no different.
              collection aspect
• The collection has to be managed and
• The organizers deal with dead matter,
• This organization is a scientific activity.
• Librarianship is a natural science.
• The librarian is a cataloger in a corner.
              service aspect
• Users have to be shown how the library works.
• Librarians have to understand users’ needs to
  build services users want.
• All these are social activities.
• Librarianship is a social science.
• The librarian is a people service person.
digital information was hard to use
• Computers had to be driven by esoteric
• Screens were hard to read from.
• Telephone lines where hard to get to work to
  transmit information
• Access costs to digital information was high.
• The service aspect was important.
digital information is becoming easier
• Computers are more and more easy to use.
• Digital information providers tend to
  communicate directly with customers,
  bypassing libraries.
• Subject literacy becomes relatively more
  important than information literacy.
• The service aspect is being reduced over time.
          an important caveat
• Most items in the modern (19th, 20th century)
  are mass-produced.
• There is no mass production or mass storage
  in the digital library.
• The difference between publishers, archives
  and libraries become very blurred.
     a course on digital libraries?
• My initial thought is that a course on digital
  libraries is nonsensical.
• In the recent future, all libraries will be digital.
         digital libraries course
• Literacy and use of digital media.
• The idea is to look at what digital libraries
  exist and how to use them.
• This is really already done in LIS511.
• The course has the “building” theme to it.
               building aspect
• Building a digital library can basically take
  three for
  – electronic resource management
  – repository building
  – cross-repository services
 electronic resource management
• Libraries license digital contents from
  providers and make them available.
• There are some minor technical issue
  – authentication
  – integration with ILS
• legal issues with the licensing
• minor training issues with users
            repository building
• Libraries are building repositories of local
  digital or digitized contents.
• This is firmly on the technical side.
• It is the main focus of the LIS654 course as it
  has been developed in the past.
• We cover digitization as part of repository
       cross-repository services
• I think of repositories as publishers, rather
  than libraries.
• Digital libraries are cross-repository datasets
  and services attached to them.
• This is where I have done almost all my work.
• It can not be done without custom computer
              course syllabus
• It draws on Brian Hoffman’s syllabus in for his
  Manhattan, Spring 2011 section.
• It is quiet non-technical. I will tune up the
  technology over the years.
• One can argue that without computer
  programming, one can not be a digital
• But most digital libraries fail because of non-
  technical issues.
               my expertise
• My main expertise is in setting up completely
  new open-access digital library services and
• In non-technical terms, I can discuss how to
  set up these service and how they run.
• But I am reluctant to appear like a self-
  promoting pompous git.
        the wider environment
• Since 2008 I have been trying to build a
  special digital information concentration in
  the Palmer MSLIS program.
• The current version is at
• The LIS654 course is not part of the proposed
  second part: repository planning
• This slide set follows Reese and Barnerjee very
• We want to get through the gist of what they
  have in chapters one and two. I skip the most
  trivial things as well as the stuff that will be
  covered in copyright and imaging lectures.
• I have not been involved in repositories but I
  don’t buy a lot what they write.
• “The ultimate success or failure of a digital
  repository is usually determined in the
  planning stage. A repository must be
  structured and organized that users can
  readily find and use diverse types of
  resources. It must be easy to maintain and
  capable of accommodating needs and
  resources that may not exist at the time the
  repository is designed.”
• Happy talk!
          planning importance
• “The ultimate success or failure of a digital
  repository is usually determined in the
  planning stage.”
• It would be useful to have an example of a
  repository that failed because it was badly
• The weak contents in many academic
  repositories suggests that all are badly
• “A repository must be structured and
  organized that users can readily find and use
  diverse types of resources.”
• Users don’t search local repositories. They
  come in through search engines or
  aggregators (which are also found through
  search engines). Optimizing repositories for
  local findability is plain wrong.
• “capable of accommodating needs and
  resources that may not exist”
• It is impossible to do that. Making this sort of
  ideas a precondition for building a repository
  slows down progress with real task.
           parallel to physical
• “Creating and managing a digital repository is
  similar to starting a new physical collection …
  new materials must be added while those that
  no longer support the mission of the
  repository should be removed”.
• The first idea holds people hostage to the past
  and the second is inimical to digital
• “one of the primary functions of digital
  repositories is to preserve electronic
  resources, though they must also provide a
  system for cataloging, indexing and retrieving
  digital materials”.
• We are still on page one, but have already a
  contradiction with statement of previous slide.
• “electronic resources” vs “digital materials”.
               missing here
• There needs to be an analysis done of the
  functionalities of the repository.
• Some of the aims of the repository may be
• Then a prioritization can take place between
  these different functionalities.
• This will allow to select an appropriate
“decision to build a digital repository”
• “Although many people treat repositories as
  short-term projects that can be funded with
  grants and other non-recurring monies, the
  reality is …”
• Building the repository will cost a lot.
• Maintaining it is ok, if you have somebody on
  staff who has minimum system administration
  skills and you can pay for external hosting and
  local backup.
• Comparing the repository to new physical
  collection is not helpful.
          role of the repository
• “The importance of physically processing
  resources is diminishing, and more value is
  placed on the ability to locate and download
  remotely stored resources. In this sense,
  digital repositories are a logical outgrowth of
  traditional library services in response to
  challenges brought by network technology.”
• discuss ;-)
example 1 (born digital) offered by RB
• An example they point out is
• This is well presented collection.
• It seem to carry over coding mistakes from the
• There does not appear to be a harvesting
                 example 2
• Locally to them, they look at
• This is a ContentDM based digital image
• This really is an archival collection.
          opportunity for libraries
•   Provide desktop access.
•   Present the library as au-fait with technology.
•   It is an occasion to set up skills.
•   Expand the remit of the library to publication
    of locally produced materials. This latter point
    mainly applies to academic institutions but
    may be to others.
     problems with repositories
• Tools are not stable.
• Migrations will be required.
• User expectations are high (erh…)
• Electronic resources are more difficult to work
• Staff adaptability or having enough competent
  staff is the biggest challenge.
    repository purpose questions
• What type of resources will it contain?
• How big is it supposed to grow?
• Who is going to use it and how?
• How can resources be protected against
• How will access and IP right be managed?
• What systems will it see to interact with?
• What resources will be available to create and
  maintained it?
   expected use of the repository
• R&B say that you have to make expectations
  about the use of the repository.
• What you, in principle, need to think about is
  how do you organize searching and browsing.
• However in practice it turns out that you will
  only be able to do what the repository
  software will be able to do, unless you can
  change the software. Changing software can
  be a tall order.
• You usually have resources and their
• The descriptions can be stored as BLOBs in a
• You need to extract the searchable from the
  descriptions to make them searchable in the
• Example: find pictures shot between 2011-04
  and 2011-05.
• This is tougher.
• Here the data has to be discrete.
• Many times the same entity is referred to by
  different values, e.g. “Thomas Krichel” vs
  “Томас Крихель”, “The Magic Flute” vs “Die
• If you want to have browsing by author,
  composer, work etc, you to, most likely
  manually, bring variant from together.
• Since paper publishing is expensive, publishers
  have to make exert some quality control.
• For physical collections, libraries have
  elaborate procedures. They have been
  evolving slowly for about 500 years.
• Libraries have catalogs, approval plans etc.
• These are of little help with digital materials.
• Most of the challenges of acquiring physical
  continue for digital assets, R&B noted earlier.
• Another cerebral fart of R&B: “The value of a
  digital collection is measured by how well it
  helps people find what they need rather than
  by the number of items it contains.”
• They continue straight: “This means that to be
  useful, digital files must be selected and
  processed before they are stored.”
      developing the collection
• Putting in resources into the repository
  because they are there?
• Rely on content providers to provide them?
• Rely on serendipity of library staff?
       R&B questions to answer
• What resources are desired and where are
• How will different versions of a document be
• Who should be involved in the selection
• What tools exist to help automatically detect
         fragmented resources
• “Acquiring resources for a digital repository is
  an inherently complex endeavor because it is
  often unclear what needs to be archive in the
  first place. Electronic resources frequently lack
  obvious boundaries.”
  – web pages
  – dynamically generated resources
           dealing with them
• R&B suggest
  – not include them?
  – reformatting them?
  – postpone dealing with them?
  – contracting out?
• The Internet Archive’s Heritrix is a software
  that can deal with the archiving of web pages.
• The reformatting of links in proprietary file
  formats may be more difficult.
        identification planning
• This is an important process of building
• Anything that is considered a resource has to
  be given an identifier.
• Identifiers can be dumb or intelligent.
• Identification may be hierarchical and it can
  then be delegated.
• [I am leaving R&B here.]
              dumb identifiers
• Dumb identifiers contain no information
  about the item that they are identify.
• For example a number can be used.
• Advantages
  – easy to create
  – no temptation to change
• Problem
  – not easy to relate to resource
          intelligent identifiers
• They say something about the resource.
• Usually, any hierarchical identification
  structure has some intelligence built into it.
• But there is a temptation to change the
  handle when there is a change in the
  intelligent matter that the handle is built on.
          Example from RePEc
• The identification strategy was set by yours
• It combine a centrally assigned archive code,
  an series code assigned by the archive, and a
  code of the paper in the series.
• This is problematic when series move
  between archives.
• I tried to later have the series code to be
  centrally assigned.
    problem of handle instability
• If handles change, there are problems with all
  services based on them.
• For example if you have an announcement
  service, the paper appears to be new.
• If you have an author claiming service, the
  author appears to loose a paper and has to
  select the paper again.
                third part: ssh
•   a bit about the computer
•   the Internet
•   ssh and our host
•   a brief discussion of the operating system of
    the host
   a few words about a computer
• To use a computer you need to know
  something about its operating system (o/s).
• The o/s sets out how the computer behaves.
• There are two o/s flavors that are widely used
  – MS Windows (in various version)
  – UNIX-like operating system
• Our server runs an o/s called “Debian
  some generalities about Debian
• Debian is an open-source computer operating
  system developed and maintained by a large
  group of volunteer.
• Debian packages together a very large set of
  pieces of software into a coherent system.
• It provides a version of the UNIX operating
  system using Linux.
• The following notes hold for all (?) Unix flavor
  operating systems.
• Files are continuous chunks data on disks
  that are required for software applications.
• Files have names.
• Files have permissions attached to them,
  discussed in "permissions model".
• Files have times attached to them. Usually
  the mtime (time of contents modification) is
  the only one shown.
• Directories are files that contain other files.
  Microsoft calls them folders.
• They have names, permissions and times like
  other files.
• In UNIX, the directory separator is “/”
• The top directory is “/” on its own.
• Links are files that contain the address of
  other files.
• In MS Windows, links are called shortcuts.
• The times and permissions of links are kept
  but they are of no importance.
            users and groups
• “root” is the user name of the superuser.
• The superuser has all privileges.
• There are other physical users, i.e. persons
  using the machine
• There are users that are virtual, usually
  created to run a daemon. For example, the
  web sever in run by a user www-data.
• Arbitrary users can be put together in
              permission model
• Permission of files are given
  – to the owner of the file
  – to the group of the file
  – and to the rest of the world
• A group is a grouping of users. Unix allows to
  define any number of groups and make users a
  member of it.
• The rest of the world are all other users who have
  access to the system. That includes www-data!
          user name & password
• To work with our server, you need a use name
  and a password.
• You can choose your user name as a short form
  of your own name.
• It should be all lowercases and can not have
• Please don’t choose an insecure password.
               the Internet
• The Internet is an interconnected set of
  physically disparate networks.
• Each computer, when connected to the
  Internet has an IP address. It's a four-byte
  number written as four decimal number from
  0 to 255 connected by dots. Example:
• Once a computer has an address, it can
  communicate with others using a protocol
  known as IP.
    Internet application protocols
• Once we have the Internet, we need protocols
  to work with it.
• They are called Internet application protocols.
• Their king is the domain name system.
• Two other protocol we will work with are http
  and ssh.
            Domain Name System
• Domain Name System allows us to associate human-
  friendly names with IP addresses. These names are
  called domains names.
• Domain names can be leased from domain nate
• A machine with a domain name on the Internet is
  called a host.
• When we know the domain name of the host, we can
  communicate with the host.
 protocols to communicate with hosts
• There are two protocols we use in this class.
  – We use http to work with the omeka web
  – We use ssh for some special operations.
• Both protocols are client/server protocols.
• You run as ssh or http client on your local
• You communicate with a machine that runs
  ssh or http server software.
               the ssh protocol
• ssh is protocol that uses public key
  cryptography to encrypt a stream of
  communication between client and server.
• This allows us to privately manipulate the
  server. Or “manipulations” are really just
  changes to files on the server that contain our
  web pages.
• The ssh client software we use on the PC is
  called WinSCP. It is a file transfer program.
                  our server
• Is the machine
• We also say it is a “host” on the Internet.
• wotan is the head of the gods in the Germanic
  legend. The name has nothing to do with
  Chinese food.
• It is a humble PC.
• It runs the testing version of Debian/GNU Linux.
• It runs both http and ssh server software.
• It is maintained by Thomas Krichel.
                  the web site
• As part of the course, you are being provided
  with web space on the server, at
  the URL
  where user is a user name that you have chosen.
• This shows a list of available fails as prepared by
  the web server at wotan.
• This is a page that Thomas has prepared for you.
                 ssh protocol
• The ssh protocol implements a secure
  connection to the server over which we can
  – send instructions to it
  – store files on it.
• wotan run an ssh server.
• On your machines, you run ssh client
            ssh client software
• On MS Windows machines, we run
  – putty for interactive use
  – WinSCP for file storage and retrieval.
• Usually, students in this class only need to
  understand WinSCP.
• On the Mac, you can use
  – Cyberduck
  – Fugu
• For interactive use on the Mac use Terminal.
• In winscp, the client that we use here most of
  the time, we don't make advanced use of public
  keys, we simply give a password.
• Note that winscp does not establish a
  connection to wotan. It simply uses ssh as a
  means to transfer files.
• When winscp saves a file, it may require to open
  a new connection and will ask you the password
  again. This request may be in a window you
  can't immediately see.
   open a wotan session with winscp
• If you see a list of session, click on “new session”.
   – The host name is “”.
   – Give your user name.
   – Click on “save”, this will save the session, after “ok”.
• You will be lead to the list of saved sessions, double-
  click to open a session.
• At initial connection, you will be shown a warning
  message that you can ignore.
• When saving or duplicating files, you may be asked to
  enter your password again. Watch out for that.
              home directory
• When your connection with wotan, and you
  have authenticated as a certain user, you will
  be shown your home directory.
• On wotan this is /home/user where user is
  your user name.
• There you see a bunch of files starting with a
  dot. Leave them alone.
• And you see a bunch of directories.
            initial files on wotan
• A directory called public_html. This is your web
  site. Everything you store there is on the web.
• A set of files starting with a dot. They are
  greyed out.
• One of them is called .my.cnf. This an
  initialization file for your mySQL client. We will
  not use the client, but we will store the
  password there.
• The file should be readable and writable by you
  only, no access to group other users.
• mySQL is as implementation of a relational
  database software. More about it later.
• It uses its own permission system. That means
  that it has a separate user/password space.
• By Thomas’ decision, your mySQL username is
  the same as your wotan user name. But
  Thomas does not know how to import your
  wotan password as your mySQL password. It
  has to be recorded separately.
• We use .my.cnf in your home directory.
         initial state of .my.cnf
1. # on line 4, replace your_password with your
   chosen mysql password
2. [client]
3. user = your_user_name
4. password = your_passsword
          web home directory
• The web home directory is /var/www.
• There you see a directory home, with a series
  of links
  – they have a user name as file name
  – they go to your home/public_html directory
• There you see a directory omeka with a series
  of links
  – they have a user name as file name
  – they go to your omeka directory
             web site address
• goes to the /var/www
  directory. There it shows the file index.html.
• goes to the
  /var/www/home directory, where it finds the
  link to the public_html directory of the user
• In that directory, it will show the file
  index.html if it exists. Otherwise, it will build
  an index on the fly.
• This is more of a technical issue.
• You will need backup. My general prescription
  would be to run the repository itself with a 3rd
  party provider.
• Locally, keep a staging (rather than
  production) server and a backup. They can
  both be on the same machine.
• All this should be part of the sysadmin course.
  common-sensical sysadmin tips
• You need physical security for any server.
• You need to keep the software up-to-date. I
  do it, roughly, weekly.
• You need to join the mailing list for the
  repository software, and the security list for
  the operating system.
• Encrypted access to the server when
  authentication is required.
• Run minimal amount of software.

  Please shutdown the computers when
              you are done.

     Thank you for your attention!

Shared By:
yaofenji yaofenji