Managing Unstructured Data by sdfgsg234


									   ISSA         The Global Voice of Information Security                                            ISSA Journal | December 2007

                                                     un·struc·tured da·ta: Computerized information that
                                                 does not have a data structure (i.e., not within a database)

Managing Unstructured Data
By Johnnie Konstantas

Year after year businesses, organizations and enterprises create an endless array of data
files, some extremely sensitive, some completely mundane. Managing this unstructured data
can be formidable, but the steps discussed here can help make the process manageable.

         ear after year businesses, organizations and enterpris-    Before an organization can even begin to manage all its un-
         es create an endless array of data files, some extremely   structured data, it must know what it has: various products
         sensitive, some completely mundane. In the form of         for data classification, eDisovery, loss prevention and content
documents, images, spreadsheets, email messages, presenta-          management require that file share contents be indexed as a
tions, multi-media files, etc., this unstructured data is stored    first step. In order to get perspective on the process, consider
throughout IT systems on file servers. Some files are accessed      the time it takes to “crawl,” or index, data. A typical medi-
daily, some annually, some completely forgotten, and manag-         um-sized enterprise has about 10 terabytes of data (10,000
ing all this data can become formidable.                            gigabytes). Indexing time takes about two hours per gigabyte
                                                                    – about 2.3 years for the medium-size company! And un-
Efforts to tame this unstructured file share data abound. IT
                                                                    structured data grows by more than 50% annually.
departments are tasked with managing all this data, keep-
ing it available to those who need it, keeping it away from         Naturally, given the length of time to complete, most enter-
those who would abuse it, and keeping it within regulatory          prises understand that data classification, migration, and
and compliance requirements:                                        protection projects have to be rolled out in phases. The chal-
                                                                    lenge is knowing where to start. Because file share data is un-
    • Risk management and loss prevention – Classifying
                                                                    structured, there is no way to discern which is important and
      information in order to identify content that is sensi-
                                                                    which is not. Think of all the files on your personal desktop
                                                                    computer you have been saving and ignoring for years and
    • Data entitlement management – Document and en-                how much time and effort would be involved viewing each
      force a scalable and repeatable process for determin-         file to determine whether you want to keep it or not. Given
      ing who gets access to what data                              enough storage space, you will probably save it for another
    • Hierarchical storage management and data migra-               day.
      tion – Moving stale or outdated data from file servers        There is a way, however, to increase the efficiency and accu-
      to cheaper near-line or offline storage solutions             racy of these efforts and shave off months or even years in im-
There are even initiatives to impose structure on unstruc-          plementation. Outlined below are ten key requirements for
tured data by migrating it to document management systems           any effort, system or technology whose purpose is to classify,
like Microsoft’s SharePoint or EMC’s Documentum.                    migrate, protect or otherwise manage business data – spe-
                                                                    cifically the documents, presentations, spreadsheets, scanned
Most enterprises undertaking projects to manage their un-
                                                                    images, multi-media files, etc., that fill file servers and form
structured data have discovered their efforts to be extremely
                                                                    any enterprise’s valued assets. The key is to begin by making
slow and costly in terms of time and resources required. Al-
                                                                    the ten activities listed below prerequisite to any project for
though technologies for classifying, protecting and moving
                                                                    unstructured data management and protection.
data exist, successful implementation may require months,
if not years, to accomplish, depending on the amount of un-             1.   Create an inventory of file share contents
structured data.                                                        2.   Remove overly permissive access permissions
Data indexing – a technical procedure that conducts a deep              3.   Remove global groups
examination of file server contents for the purpose of creat-
                                                                        4.   Remove unused accounts
ing a logical ordering of the unstructured data therein.

Managing Unstructured Data | Johnnie Konstantas                                                      ISSA Journal | December 2007

     5.   Identify orphan data                                      risk of data loss is incurred every day that the project is in its
     6.   Identify stale data                                       implementation phase.

     7.   Identify infrequently accessed data                       The solution is to first start with a broad reduction of access
                                                                    privileges so as to limit access to only the users who have a
     8.   Identify highly active data                               business need for the data. This step dramatically reduces
     9.   Identify business owners                                  the probability of data loss and can be conducted in a frac-
     10. Repeat above activities monthly to accommodate             tion of the time it takes to index and classify information.
         change                                                     By revoking permissions as a first step prior to a DLP proj-
                                                                    ect, enterprises can reduce the exposure in the interim time
The discussion that follows elaborates on these required ac-        frame while the DLP project is being scoped and rolled out.
tivities and explains their benefit in expediting and scaling       The process for revoking permissions to data should be auto-
data management.                                                    mated and include:

1. Inventory file share contents                                        • Identifying the names of those persons who no longer
                                                                          need access
Most enterprises do not exactly know the contents of their
file systems. Each user in an organization has his own parti-           • Identifying the data sets to which those persons no
tion or space on the file share, and and generally uses it at             longer need access
his discretion. Access permissions for this data are also hard          • Centralizing dissemination of the permissions revo-
to track. A folder may be available in a limited fashion to a             cation to the live environment
handful of users, but as needs for the data therein increase,           • Recording permissions pre- and post-revocation as
permissions are changed to include many groups or the                     part of a change report
whole company. The net effect is that the IT group and file
server administrators do not have an accurate picture of what       3. Remove global groups
is contained on the file servers. Step one, then, is to create an
inventory of the data and its permissions, as well as the list      Data access “permission creep” is quite common. In fact nearly
of persons with access. This will help guide the next steps in      100% of organizations can identify some files or folders where
protecting and managing it.                                         permissions for access are overly liberal. As file share contents
                                                                    grow and individuals change roles, their business needs for
The inventory of the file server contents must include:             data grow. As a result, IT operations personnel are forced to
     • All users, including their group memberships, Active         open file access controls for broader and broader data avail-
       Directory attributes and data permissions                    ability. In some cases, global groups are assigned. These are
     • All folders and sub folders within a file server, as well    groups that according to Microsoft Active Directory nomen-
       as the Microsoft NTFS permissions to this folder for         clature include a very large percentage of the organization’s
       any user or user group who is part of the domain             user population. The Everyone group is one such designation.
                                                                    As its name suggests, when the Everyone group is assigned to
     • Filtered views that allow queries based on user name,        a folder, it makes the data within available to everyone. With
       group name or folder/data name                               the right solutions and technology in place, removing glob-
     • Automated updating of views to reflect changes or            al group access permissions and replacing them with more
       new data within Active Directory (i.e., user-to-group        granular access controls is something that can be done fairly
       membership) as well as within the file server (i.e., new     quickly, and serves to dramatically reduce the probability of
       data, deleted data, renamed data)                            data loss by restricting access to business-need-to-know. Any
                                                                    project for data loss prevention will benefit from global group
2. Remove overly permissive access                                  removal as a first step. The process should include:
Most enterprises have very efficient processes for granting ac-         • Identifying folders with global group assignments
cess permissions to data, but few revoke those permissions              • Identifying individuals who require access to those
when the need has passed. As a result, most access to file share          folders that have global group assignment
data is unwarranted and the permissions are dated.
                                                                        • Removing global groups
In part to address this, and in part to achieve the broader
goal of stemming the dissemination of critical information to           • Assigning individual permissions
unauthorized persons or to those outside the company, some              • Recording the revocations
enterprises undertake data loss prevention (DLP) projects.
These projects normally start with technology that looks to         4. Remove unused user accounts
create an index of unstructured data and subsequently to            As with unwanted data permissions, enterprises often find it
classify it for the purpose of identifying sensitive and valu-      difficult to keep track of accounts within user repositories. As
able information. The challenge with this approach, however,        individuals leave a company, change roles or move within and
is that because indexing and classification take so long, the

Managing Unstructured Data | Johnnie Konstantas                                                      ISSA Journal | December 2007

across organizations, their account types need to be changed
or revoked. This updating, although seemingly useful, does
                                                                    8. Identify highly active data
not take place as a matter of course in most enterprises. It is,    Constituent to the requirements outlined above is the ability
however, a fairly simple and quick way to reduce the risk of        to identify highly active data. Applying the principle from in-
exposure to data loss by removing unnecessary access privi-         frequently accessed data in the converse, the data access audit
leges. This activity also increases the accuracy of data-use        should be able to be sorted to specify a time period and an
monitoring by ensuring that the accounts are not co-opted           event count, and then quickly zero in on the most actively
for use by persons other than the rightful users. Removing          accessed files and folders on a share or shares. This most ac-
unused user accounts from Active Directory should include:          tively accessed data is implicitly of high business importance
                                                                    and is not only a good candidate for in-line storage, but for
    • Identifying inactive accounts
                                                                    indexing, classification and loss prevention projects. By com-
    • Verifying systematically that they are not in use             pleting the identification of the most active data sets prior to
    • Conducting the revocation from a central location             any such projects, IT operations personnel can shorten the
                                                                    project length, focusing their efforts on the most important
5. Identify orphan data                                             business information first.
Projects to migrate data to different tiers of storage (i.e., in-
line, near-line, off-line) necessitate that information be seg-
                                                                    9. Identify data business owners
regated into that which must be readily available versus data       For all of the requirements of proper data management dis-
which is of less critical importance and can be archived. Typi-     cussed, it is important to note that having a list of the data
cally, such projects can take a very long time to complete, es-     business owners can markedly increase the accuracy and
pecially if data indexing and classification are the approaches     efficiency of each requirement. By consulting with business
applied to the task. Data migration can be made less arduous        owners, the administrators of unstructured information can
and much faster by analyzing how data is used and identify-         ensure that permissions revocations, data migrations and
ing that data which has no known owner (i.e., has not been          access controls are commensurate with business needs and
accessed by anyone). File share data without an owner is con-       company policies. IT operations personnel should be able to
sidered orphaned and is a very likely candidate for offline         generate a list of data business owners for any given data set at
storage. Completing this task should not require classifica-        will. Business owner identification should be capable of being
tion or indexing and should, in fact, be a precursor to both.       completed “on demand,” given that this information spans
Orphaned data, once identified, can be examined for content         many projects and needs.
type at a later date. However, because it is not in active use,
the urgency of this task decreases dramatically.                    10. Repeat steps for scale and change
                                                                    Since unstructured data is not only the most voluminous,
6. Identify stale data                                              but the fastest growing data type within organizations, it is
As with orphan data, identification of data that has not been       important that enterprises set up processes by which these
accessed for several months to a year is a very efficient way       requirements for data management can be applied and fol-
to group data sets as good candidates for offline storage or        lowed at regular intervals with consistent results.
deletion. This requirement can be completed fairly quickly
and should not require use of indexing or classification tech-      Conclusion
nologies.                                                           Getting control of unstructured data is an imperative for com-
                                                                    panies, but IT operations personnel have been challenged on
7. Identify infrequently accessed data                              how to get started. As noted, there are products in the mar-
As with orphaned and stale data, some file server contents are      ketplace that focus on the business content and context of
so infrequently accessed that they may be good candidates           data. However, their implementation is very complex and, as
for near-line storage. It is important to understand your or-       a result, lengthy and costly. The requirements outlined above
ganization’s data requirements and what are good metrics for        enable businesses to realize the benefits of these products in
defining frequency of access. Completing this task, however,        a pragmatic way. Meeting them as a first step ensures data
need not take a long time. Again, the approach used does not        security is addressed and that resources are applied with the
need to make use of classification or indexing. Rather, in or-      greatest business impact.
der to identify infrequently accessed data, a data-use audit
can be used. The most effective data-use audits include the         About the Author
ability to sort the information by time interval, event type        Johnnie Konstantas, vice president of marketing for Varonis, has
(i.e., open, delete, rename, create), or frequency count, for       more than 14 years experience in the network-security and tele-
example.                                                            communications fields, having held various senior-level roles in
                                                                    marketing, product management and engineering. She may be
                                                                    reached at


To top