OSG Security Roadmap
Mine Altunay, OSG Security Officer
It is essential that we get our security hygiene cleaner for the operating infrastructure in the short
term. In parallel we can be thinking and working on the design and principles of an infrastructure that
meets the needs for the longer term of a) dynamic, ad-hoc VOs b) removal of constraints due to
legacy technologies c) enabling a more diverse mix of resources and applications including
commercial resources, licensed applications d) appropriately divested distributed authority and
control to the sites and VOs.
Thus this note gives the activities I recommend we do over the next year or two to contain the risk and
provide appropriate defense and response to security issues.
In particular, while the process of proxy delegation and renewal may be broken in principle, in the
time frame of a few years this will remain as the operating infrastructure and we must fix the holes
and software to operate on a more secure technical basis.
OSG Security Roadmap .......................................................................................................................... 1
Short term security needs and goals (6 months to one year) .............................................................. 1
1. Authentication: Proxy Cleanup: ............................................................................................... 1
2. Authentication: GSI CRL problem .......................................................................................... 2
3. Authentication: Distribution of CA and CRLs ........................................................................ 3
4. Authentication: GUMS-VOMS Handshake ............................................................................ 4
5. Authentication to VOMS Admin and VOMRS ....................................................................... 5
6. Authentication: MyProxy and certificate renewal ................................................................... 5
7. Authorization: Banning tool .................................................................................................... 5
8. Authorization: Uniform FQANs across sites, and grids .......................................................... 6
Longer Term Needs related to both Authentication and Authorization: ............................................ 6
9. Certificate and Proxy issues ..................................................................................................... 6
10. Configuration management tools. .......................................................................................... 7
11. Validation of security software: ............................................................................................. 8
12. More complete error reporting: .............................................................................................. 8
Operational Security: Incident Response and Containment ............................................................... 8
13. Monitoring the Grid, preventing wide-spread of incidents .................................................... 8
Policy Work ...................................................................................................................................... 10
Short term security needs and goals (6 months to one year)
1. Authentication: Proxy Cleanup:
Why needed: Operational Hygiene and Incident Containment
After a job finished, the proxies should be deleted. For shared group accounts and pilot job
accounts, this is an obvious requirement since the account is accessed by multiple users. However,
for individual pool accounts and/or individual accounts (accounts not shared by multiple users), this
is also a requirement. When an incident happens on the machine, left proxies causes a big headache
and exposes other sites that the proxy owner may have access to. In such cases, we revoke the user
certificate and try to notify the possibly-affected sites. However, the attacker himself may delete the
user account, or the proxies to cover up his act. Moreover, site admins may not immediately be
aware of the attack going on, or may not know which users had used the local account (consider the
case that site maps users into individual pool accounts based on FQANs – so whoever got mapped
into the compromised account may have left their proxies there.)
In the case of GRAM2, a proxy is delegated on each job submission. The proxy file is deleted once
the job is completed. Apparently, condor-g restarts the job manager, in which case the proxy file can
be left behind.
GRAM4, allows for credential delegation to be optional and a delegated credential can be shared
across multiple jobs. So it is up to the client to explicitly destroy the credential using the delegation
service interface. The clients GRAM4 have:
1. globusrun-ws: destroys credentials once job is completed
2. condor-g: uses one delegated credential across all jobs and no explicit remove. Need to
discuss with Condor folks for details.
3. gridway: don’t think it is of relevance to OSG.
Need a technical discussion of options. Mine to arrange in next 6 weeks.
[D.O. – So the requirement can be stated as the proxy should not be left longer than needed for the
work, and it should not be available to other users. So we can tell our software vendors we want
proxies deleted when the work is done, but work may be > 1 batch job. We can tell sites/VOs that if
different users (DNs) are mapped to the same UID there is a likely proxy sharing problem and
recommend to map to different UIDs, and let the sites/VOs decide and advertise themselves
accurately. I think many will want to map whole VO to same UID. We are not developing Condor or
Globus or other clients, except maybe Panda, so what is our action besides talking to people?]
[MA – no I am not developing any code. I rather ask our vendors to check their code and make
changes. I have gathered the current situation form our vendors and will send a technical report on
2. GSI CRL Authentication problem
Why needed: Major Authentication Vulnerability
When a site has no CRLs associated with a CA, the GSI authentication does not fail. The
authentication MUST fail because without CRL, GSI cannot know if this is a revoked cert
or not. Although it is not GSI’s responsibility to download the latest CRL, it is GSI authN responsibility
to fail when the CRL is absent. [D.O. – We can state it as OSG policy that a registered resource must
use CRLs, and authN must fail if CRL does not exist; before service is opened to public. There can be
good testing/installation reasons for allowing a service/resource to run without CRLs during
A fear is that if GSI fails for each missing CRL, the sites would object because it would put burden on
them. I think this is exactly the case we have to create. Sites must have this burden of keeping CRLs
and CAs up-to-date. PKI is not designed to work without CRLs and trying to do otherwise is a security
We can combine an automated download tool with GSI authentication failure. However, this may
cause many clients trying to download CRLs on the network. Another option is to setup a cron job
from each gatekeeper, set up a CA distribution service at GOC and keep track of sites (who has
uploaded who has not) This cron job must be set up in VDT by default and check with GOC service to
Globus has different plans: they are focusing on pragmatic solution where they can use MyProxy to
provision the users with the trustroot info as well as updated CRLs. Those CRLs will be copied in the
expected directories such that they will be used for the subsequent validation. In other words, the
clients should periodically go to their home organization's myproxy server, which will maintain an
up-to-date set of CAs and CRLs for all its clients to use. The hope to extend this model also to
servers, where servers would periodically pull updated trustroot info from myproxy. The advantage
of this approach is that the distribution is centralized for the sites, and that it works with the existing
Since OSG does not have experience with Myproxy – it is in the VDT but we do not ask our sites to
use it—this might be troublesome for us.
Need a technical discussion of options. How about at next Facility or Security meeting?
[D.O. We can/should have RSV probe to check status of CAs/CRLs installed. We have vdt-update-
certs tool to help keep clients & servers up to date. We can not check if clients are up to date. What
is left to do?]
[MA: yes we do have two scripts: fetch-crl (to update CRLs) and vdt-cert-update (to update ca
directories). The problem is when GSI fails due to the lack of crl, we must provide a tool to easily
update the missing crls and certs. I would like to build a service for a site, where each gatekeeper of
that site can go and update. The site can update in turn from GOC cache. I am not happy with scripts
all over the place and site admins not knowing how to use them.
3. Authentication: Distribution of CA and CRLs
Why needed: Major Authentication Failure
Our sites do not update their CA directories as often as needed. As explained above, GOC can
maintain a service for the sites so that they can get updates on CA files and CRLs. It would be
immensely useful if this service could use RSV probes to see which versions have been downloaded
by a site and sends a warning to the site admin and other interested parties. I am purposefully
saying other interested parties because CRLs should not only be used by sites, but any client, VO or
site services. #4 explains why VOM(R)S need CRL checks.
As future work, we must understand which clients (including the user’s tools and browser) must
make use of CRLs. This is a heavier problem because client applications are numerous and users are
not very knowledgeable.
[D.O. – Same solution as #2.]
4. Authentication to VOMS Admin and VOMRS
Why Needed: Admin level compromise at VOMS server
Both VOMS admin and VOMRS have a CRL policy of check-but-do-not-require policy. In one of the
recent incidents, we realized that stolen certificates can be used to access either of them with the
admin-level access. In other words, one of the revoked certificates belong to a VOMRS admin and an
attacker could have gained access to VOMRS by using this certificate.
Input as a requirement to the developers and check as the VOMRS/ VOMS-ADMIN possible merge
[D.O. -- Yes, all user authentication should be authenticated with trusted CAs and CRLs.]
5. VOMS-GUMS Authentication
Why needed: Major Authorization Problem (yes user authorization – not authentication)
When validating a user’s VO membership, the GUMS must authenticate with the VOMS server.
VOMS server tells the GUMS, or whoever else asking, whether that user is a member of the VO or
not. If GUMS does not authenticate with the VOMS server, it may
be exposed to man-in-the-middle-type attacks. Imagine that an attacker building a malicious VOMS
server. Unless GUMS checks the certificate of the VOMS server, it will mistakenly provide access to
malicious users that are not members of the VO. In addition, there must be a secure (e.g. SSL-based)
connection between GUMS and the VOMS so that integrity of the information is kept.
There is no authentication between GUMS and VOMS. Anyone who can IP-spoof a known VOMS
server can generate an attack. As a result, GUMS server would allow access to non-VO members.
The non-VO member (malicious users) must have a valid DOE certificate, but they can get hold of
stolen certificates after a prior security incident, or use some revoked certs, or even worse, they can
be legitimate people with existing DOE certs but not a member of this particular VO. Since our sites
are known to update their CRLs late or not at all, this attack can very well work.
Request made to VO service project. No promise yet. Need to work to get commitment and use
existing OSG contribution from BNL if possible.
[D.O. – Isn’t this issue obsolete with the planned upgrade of VOMS/GUMS that relies on validating
the VOMS signature of the FQAN in the proxy certificate?]
[MA: I am writing another document on pros and cons of AC validation. Please wait until you see
6. Authentication: Shorter Proxy-lifetimes
Why needed: Operational Security, possible interoperability problem in future
In OSG we do not use MyProxy servers to renew expired proxies. We instead allow really long-lived
proxies such as 1 year. This causes a potential security problem since anyone who steals the proxies
can do anything with them on behalf of the original user. We of course do not have CRLs for proxies
– this is natural. There is work in EGEE that long-lived proxies will be rejected by the gatekeepers.
When this happens, our users would lose ability to submit jobs. Moreover, proxies expired in the
OSG sites would not get renewed.
In addition to this, during an incident, proxies cause the biggest problem. As prevention, OSG sites
may want to reject any proxies with a too-long lifetime. Currently, since there are no CRLs for the
proxies, we rely on revoking the end user’s certificate. This is unnecessary. The attacker has not
compromised the user’s private key or passphrase; the attacker only gained access to the proxy. If
we had relatively short proxy lifetimes, we may stop relying on revocation, which is not very fast any
Revisit end-to-end use of proxies and where reducing the allowed proxy life-time will affect things.
This will require changes in the infrastructure because we will need MyProxy inserted into the
infrastructure. Talk to John Hover in the Security Team meeting.
[D.O. – I agree that server-side authZ should be able to decide based on proxy lifetime, and perhaps
GUMS is the appropriate place to do it. I think myproxy servers will sprout up in various places. OSG
can run one, DOEGrids does run one but does not get much use and they probably would want a bit
more discussion before it gets heavy use.]
[MA—I am not sure if GUMS should do that. I rather see this as gatekeeper during authentication.
Any proxy with a too-long lifetime *may* not be authenticated by the gatekeeper. Lease notice the
may – it is not mandatory. We have to understand how proxy expiration is checked during the
lifetime of a running job. In other words, if the proxy expires in the middle of job running, what
7. Authorization: Banning tool
Why needed: major authorization vulnerability
Without a banning tool, OSG sites have no way of authorizing their users. For so long, we tried to use
GUMS (a mapping tool) as an authorization tool; however, this has difficulties, explained in my
earlier banning tool proposal. First, GUMS currently synchronizes with VOMS every 6 hours, meaning
that the admin must ban the same user at every 6 hours. The synchronization was a quick hack to
our past problems. We did it because we had trouble distributing VOMS certificates; therefore,
GUMS or the gatekeeper would not verify by looking at user’s extended certificates whether the
user has VO membership or not. Synchronization was brought up as a quick solution. But since we
are achieving the VOMS certificate distribution problem, this will soon go away.
Once we get rid of the synchronization, banning a user would become even harder. GUMS will store
only FQANs, and will not have to list DNs. DNs are already included in extended certificates. Any user
with an authorized VO attribute should be mapped by the GUMS tool. This means that bannig a user
based on his DN name become impossible because GUMS would not have DNs. On the other hand,
GUMS could drop an entire role or VO mapping.
A request has been made to OSG Privilege Project and to Globus. I will send a more detailed
requirements document to privilege project.
[D.O. – Good to have the functionality. Unfortunate that GUMS does not work for it because it is
already used as a site-wide PDP and using SAZ just means people need to run another site-wide
[MA—yes another option is can we use gums any way to ban someone? We can explicitly create a
null mapping for a user but synchronization at every 8 hours with VOMS would make this impossibly
hard. Site amdin has to change the mapping to null each time. Another problem is if the individual
user DN is not listed in GUMS – the user role is mapped instead, GUMS cannot really ban.]
8. Authorization: Uniform FQANs across sites, and grids
Why needed: interoperability
We should understand how FQANs across mapping tools (e.g. GUMS) and across the grids are
interpreted. A uniform interpretation would greatly help the interoperability
[D.O. – Yes.]
Longer Term Needs related to both Authentication and Authorization:
9. Authentication: Certificate and Proxy issues
1. Easing X.509 certificate approval process.
Our naive or non-security expert users have trouble with: certificate signing requests,
managing keys and etc. NERSC is working on a seamless system connected to their LDAP to generate
certificates for their users. Fermi can generate Fermi KCA certs. Doug suggested tying VO
registration and DOE certificate processes together. This would have a high impact on usability of
security infrastructure. In addition, Globus’ Dorian project address this problem by generating
certificates for the users, after users authenticate with their home organizations.
2. Distribution and Renewal of host certificates.
This is essential for mutual handshakes. John Hover at BNL is already doing a project for atlas
sites. Can we adopt this for OSG.
3. non-X509 forms of identity management systems.
Can we adopt other forms of identity management systems such as Shibboleth, openID,
InfoCards and such? See long term goal#2 in authorization.
[D.O. – I would not say X509 is an identity management system, it is an identity token or
credential. Yes, we should push for a federation of identity management systems from which the
grid credentials can be derived. Examples in our OSG/EGEE space already exist for Shibboleth
1. Allowing a user to select which GUMS mapping is returned from. Users has no way of
knowing which account they are going to be mapped to. Different jobs may require different
privileges; hence different mappings. GUMS returns the first matched mapping.
[D.O. – The FQAN should define a unique mapping. The user should not need to know about
[MA: yes but the issue here is GUMS has configurations set by site admin. For example, if site
says return first mapped account that could be different from the account desired by the user.
Imagine user attribute is /CDF/role-developer. If site selects no-exact match configuration, the user
can be mapped into the account that matches /CDF; not the /cdf/role-developer. Site configuration
plays a big role on what is mapped.]
2. Why do we have to be VOMS attribute proxy dependent? Can we use other attribute
providers? Or other mechanisms? To achieve this, we must first get rid of our VOMS-dependency,
which is we must address AC validation problem. By implementing attributes in a common protocol
such as SAML would help. Today since GUMS downloads DNs and FQANs directly from VOMS
database, we cannot use other attribute providers with GUMS.
[D.O. – The important issue is that the token can be validated at the site without run-time
contact to external services (non-local). How does SAML help OSG?]
[MA—we can start using other attribute providers if we switch to saml standard. Such as
Microsoft infocards, higgins project or shibboleth ]
3. Limited-proxies. Proxies are used for delegating a user’s rights to a grid site. Doing a full
delegation means that the site receives the proxy can use the proxy in any way. If there is an
incident at the receiving site (or if the site is malicious), a hacker would gain the same rights. This
causes a great headache during incidents. If we have limited proxies, this would decrease the
number of other sites that may be affected by the incident. For example, if user delegated her proxy
to be used on site 1, any access request by compromised proxies to site 2 would be denied. Hence
the attack spread would be prevented.
[D.O. – The GT2 gatekeeper already enforces limited proxies as a compile-time option, which
is the way it is distributed by OSG. What other work needs to be done? Just be sure to not loose
[MA—I am not sure on how limited proxies work in GT. We must understand this more.]
11. Configuration management tools.
A site admin has no easy interface to manage the site security. He has to manually and redundantly
change multiple configuration variables to make any changes. This is error prone and most of the time
difficult for the site admins. Important configuration choices are: list of trusted CAs, list of trusted and
collaborating VO (VOMS URL), GUMS (or other mapping tools) config variables (such as the
template version, updates and etc), turning on and off CRL related software (fetch-CRL and vdt-cert-
update). We first need a complete list of all configuration variables and then must gradually include
them in this management tool.
[D.O. – OSG is not the only one with the problem so lets not just build a tool on top of our existing
features but look at other trust management efforts.]
[MA: such as I am curious to hear from you on these tools ?]
[MA – Miron wants to see this work elevated in the roadmap and perhaps combine with #3 CA and
12. Validation of security software:
This can be built on top of tool described above. The configuration choices are validated against the
actual system in order to see if the site software behaves as expected
[D.O. – I would expect that configuration validation is better done independent of the management
13. More complete error reporting:
Detailing why authorization/authentication have failed.
[D.O. – The fundamental problem, how to give useful error information back to someone who failed
Operational Security: Incident Response and Containment
14. Monitoring the Grid, preventing wide-spread of incidents
Why Needed: Operational Security, attack containment
The biggest operational threat for OSG is wide-spread of an incident. OSG does not assume sites’
security responsibilities, and operates on the assumption that sites can be attacked at any time.
Moreover, we realize that the attack vectors may not be limited to grid related accounts or
certificates, but may be related to non-grid software and users.
Our goal is to provide help and support to the possible maximum extent that the incident can be
contained and does not widely spread.
To achieve this we should understand a) what factors, other than the obvious grid certificates,
proxies and jobs, can cause a spread, and b) how to monitor these factors. We have a few options
to achieve our goal:
1. We ask all OSG sites to store log files from their grid machines in a separate central
machine. This would greatly help us understanding:
a. Whether a user indeed an attacker or not.
b. An estimate attack timeline.
c. Who else during that time used this machine? Currently we ask site admins
whether they have certs stored on their machines. But as in INFN, if the compromise is root
level, how would they know for sure whether there were more certs or not (INFN logs were
deleted completely). Storing logs on a separate machine with some encryption would help
d. Where else the compromised certs can have access to. We can check the
compromised user’s access rights in GOC GUMS, and by using the VORS to see which sites
does the compromised user’s VO has access to, and by asking the user’s VO about the user’s
privileges – this may already be in GOC GUMS via FQANs.
2. Keeping log files at VO job submission portals.
This would be immensely useful – although not all VOs use portals. We can in an automated
way compare the VO logs and site logs to detect unmatched (suspicious) activity and notify the user
and the site. There are downsides to above solutions: the size of the log files and the sites’ and VO’s
privacy policies. The log files may keep a vast space and their number is large; however, we can use
indexing tools to deal with this. Tools like splunk can ingest various formats of data and allows index
based searching. We can index for example based on user DN, FQAN and job executable instead of
other log details. This approach would also satisfy the sites’ and VO’s privacy policies. Currently, sites
provide accounting data to OSG, which can be used for logging purposes. Accounting data includes
user DN and job executable name.
To improve our abilities, we can do the followings currently:
i. We can ingest raw accounting data into an indexing tool. I am already playing with splunk
and it seems it can handle gratia data easily.
ii. Collect accounting data from VOs.
iii. Automatically compare the above 2 data sources for suspicious activity.
The above three steps are well within our capabilities and we have already started some work with
fermi’s Computer Security Team (CST). CST uses indexing tools to analyze various logs from the
network and etc, and they have a lot of experience with this.
[D.O. – I am curious to find out how well this works.]
3. Inter-grid collaboration: sharing incident data across the grids.
This is a longer term activity. We must determine the scope of the data we can share across
the grids. We cannot share raw log files, but possibly we can share indexed data for common users
and common sites. We already share data via email lists.
[D.O. – Not entirely clear to me what you are asking for here. It sounds like a policy
[MA – right this is more of a policy issue. We can generate a procedure from this]
4. A mechanism for proxy tracking:
A trail of sites and machines that touches the user proxy from job submission to the finish.
This would allow sites to develop access policies, where they can deny jobs that have trailed through
un-trusted domains. For example, a site denying to collaborate with another grid site can do so by
examining the proxy trail.
[D.O. – Sounds like a research project that OSG can tell the funding agencies it sounds like a
good idea to include in some cybersecurity program. Sounds like the OSG work is to write a
15. Fire drills
It is very important that we continue with our fire-drills. I plan to repeat the previous drills: revoking
a certificate and banning a user.
In addition, my future drill scenarios
How quickly we can drop a site from OSG infrastructure. I do not exactly know what it
means to “drop a site” but I hope to learn with this exercise. In case of an emergency,
we need to contain an incident at a site. Thus, we need to drop it asap.
How quickly we can drop a VO. This is the same as above.
Site’s forensics skills. We create a pseudo incident on a given site and later observe
the site admin’s abilities to pinpoint the incident data. This would show us how much
traceability and logging expertise we have in our sites.
VO forensics skills. We create a pseudo incident and ask a VO to back-track the
suspicious user. This will show us how diligently VOs keep track for which privileges
are given to their members.
[D.O. – I think we need to be careful not to run too many fire drills due to risk of alienating sites &
VOs. The cooperation of sites and VOs is the most important aspect of containing an incident.]
We are missing several essential policies necessary for our functioning and auditing.
We are working on local VO policies now: VO Agreed Usage Policy and VO Registration Policy.
These policies are essential for VOs to track their members and their assigned privileges. We expect
to see how VO store their membership data, how they revoke and grant access to their users and
etc. This information is needed for auditing OSG and our VOs.
[D.O. – I think we need to understand the needs and scope of auditing.]
We are working on OSG VO AUP and registration policy – this is for OSG, OSG EDU, and OSG Engage
VOs. We are also starting to work with ATLAS and CMS.
assessment plan from last year against NIST guidelines. We will work on a new audit policy.
We will complete traceability and pilot policies in conjunction with JSPG
In the short term continuing the current focused work as Jim comes on board I think we can have a
good initial set of policies (if the review can proceed in a timely way) by the end of summer 2008.
Long Term goals are to work on site related policies and VO-site agreements.
[D.O. – Which VOs and Sites are asking for VO-site agreements?]
Below is a summary of the status of policies in OSG:
Policy OSG Procedure Status of OSG specific and Joint Policy
Top Level Grid Security Policy
n/a JSPG-V5.7, not approved by EB
User Registration and VO VO Registration procedure
OSG Security team will start reviewing
Site Registration Procedure Resource registration
Agreement on Incident Incident Response Plan
OSG Doc 19
Grid Acceptable Use Policy n/a
OSG Doc 86
Audit Requirements n/a
Approval of Certificate n/a
OSG Specifics are produced
Guide to Application, n/a
Middleware and Network
VO Security Policy n/a
OSG security team will start reviewing
VO Operations Policy n/a
Site Operations Procedure n/a
Approved by OSG Council: OSG Doc 676
Service Agreement n/a
OSG Doc 87
Policy on Grid Pilot Jobs draft at the JSPG.