Talk:Policy Trusted Virtual Machines
From JSPGwiki
| Table of contents |
Discussion on draft V1.0 (19 Mar 2010)
Igor Sfiligoi:
- One question that immediatelly comes to my mind is: Why is the VM being created by a "Site"? It is not site-specific, as you point out. Wouldn't it make more sense to assume a VO will create the VM image?
Dave Kelsey answer:
- The HEPiX Virtualisation Group is assuming that Sites will produce the images (presumably on behalf of a VO).
Igor Sfiligoi:
- Is JSPG thinking only "trusted VMs" will ever be allowed to run on the Grid? I.e., are we moving away from the "run arbitrary code" principle of the Grid? Or is this only the case for "multi-user-VMs", and it will still be acceptable for users to submit their own, fully customized VM images?
Dave Kelsey answer:
- JSPG currently has no view on this yet. The approach of the HEPiX working group is to define what is needed to make images trustworthy so that Sites will run them. Feedback to date is that many sites will not agree to run any arbitrary VM.
Tony Cass: Producer conditions:
- “Before generating images for use on the Grid the Site must have approval”... How is this approval granted? By whom? And based on what criteria?
- Should there be a list of registered Producers?
- Where is “elsewhere”?
- Do we need a period of validity? Isn’t clause 6 enough? As written, there is no restriction on me putting 9/9/9999 as the expiry date.
- “The Producer” or Site? Must be provision for what happens when the Producer is away or has left the Site.
- As above.
Points 7 to 12: aren’t these all consumer tasks? Certainly much would depend on how the image is used at the Consumer site.
Consumer conditions:
- Why do we want to say this here? Sites are free to use virtual images they have generated locally and no-one would comment on the hypervisor setup in that case.
- OK
Points 3 and 4: I think this matches my comment above that how the image is used is a Consumer responsibility. But, again, why does this need to appear? And is it required? Why can I not use an ATLAS generated image to run some tasks for another VO if I happen to know the image is perfectly appropriate for that? (E.g. I know the image is SL6 + ATLAS software and someone else just needs vanilla SL6. Surely I should be able to use the image and not have to go through the overhead of instantiating a new one?).
Mine Altunay:
- I think this is a fine first draft. Defining the operational procedures accompanying this policy would be more challenging. I am not clear on how Grid will approve, in a timely and efficient manner, virtual images before production use; I am worried about the effort and expertise needed to perform an audit. If we can define procedures that are clear and easily-followed by the image producers, it would ease the burden on Grid security teams.
Jie Tao:
- Shall we have any statement in the policy document about the license issue of the software packages in the images (provided by the producer)?
Oxana Smirnova: I don't quite understand the addressed use case. I see two basic use cases for VM images:
- Virtualisation of Grid services
- Virtualisation of execution environment
In case (1) VM images are most likely to be provided by the Site, although one can imagine some being produced by the middleware providers or even VOs - if it is a VO-specific service.
In case (2), which I think was meant to be addressed by the draft, images in all likelihood will be produced not by Sites, but by the VOs. Therefore I don't understand why the draft imposes requirements on Sites: it should be other way around, Sites must impose requirements.
Discussion at the HEPiX Virtualisation Working Group meeting on 22 March 2010
Copied from Ian Gable's minutes (slightly modified):
- David commented that the Draft Version 1.0 is very much a zeroth version. He said that is was more important to arrive an understanding of the goals before trying to arrive at the correct words to accurately describe those goals.
- Several people commented on the need for a role called approver. This a role distinct from that of producer. However, the producer and approver need not be separate people.
- Dave proposed that that the policy document not include technical best practices and these details should be in another document. The group agreed.
- There was a discussion if it will be VOs or sites that would approve images, or even if it was important to make the distinction. I (Ian Gable) struggled to find a conclusion to this discussion.
- Dave also proposed that the policy document address only the producers of images and not the consumers. The group agreed.
- Tony asked that the policy include something to the effect of "The image should not prevent the consumers of the image from fulfilling their obligations under other grid policies".
- Ian Gable commented that one element of the policy should be that the image contain no pre-existing accounts or credentials.
- Dave and Tony both commented that there needs to be a parallel technical best security practices to go along with the policy document. Romain Wartel was put forward as possible good candidate to lead this effort. David Kelsey will contact him to begin arm twisting.
- Dave asked about naming of images and their properties (expiry dates, version numbers etc). Owen had no strong feelings about names but was adamant that images have expiry dates. He advocates independent user tag from image names, such that there can be one or two images who match a user tag.
- Tony asked that the consumer section of the document be reintroduced and address what could be contextualized. Tony said by e-mail after the meeting: we did indeed conclude that there should be consumer section but referring to constraints on changes to the image as you mention-yum (or other updates), areas not to be altered be contextualisation-not items, such as those in the draft, that are really repeats of obligations in other policies.
Other discussion points noted by Dave Kelsey:
- Should be clear that untrusted images can be run by a Site if they decide to. But the scope of this policy is "trusted" images.
- perhaps untrusted images should be firewalled off?
- Both Producer and Approver should sign the image
- there will be VO-specific images and more general images
- Producer should document the recommended configuration
- All updates to images should be made by the Producer. Consumer should not perform any updates (see point 6).
Issues discussed at the JSPG meeting at Nikhef on 25/26 March 2010
This meeting discussed draft Version 1.0 and produced Version 1.1 of the draft policy. The following points are extracted (and somewhat modified) from the JSPG meeting minutes.
- Decided to change the role "Approver" to "Endorser". Endorsement is non-technical, whereas approval implies some auditing and review. The endorser perhaps implies a lower level of accepted liability.
- Lots of discussion about the model: who is generating the images and for what operational environment?
- The use case CERN faces is the request for VO-built VMs that will appear as normal worker nodes within the context of the batch system, that are instantiated as jobs submitted to the standard batch system. It's like having the VO installing nodes and connect them directly to your internal network. So, the VMs would become a part of the batch system, and be run inside the trusted environment of the site. Most other use cases call for selfcontained VMs (e.g. the BiG Grid use case), be they generated by the VO or the user.
- Normally, sites will contain the image and it will NOT have any privileged access to the site. Certainly not get access to trusted ports or share NFS or be part of the 'regular' batch system!
- The model we are addressing in the policy needs to be clear first!
- Model 1 is the Computer Centre view. Increase the number of worker nodes by virtualising them. Fully controlled by the Site with full access to the batch system and network file system. Neither the VO nor the User has root access.
- Model 2 is the VO view, but with images produced by a small number of trusted people on behalf of the VO. Similar to some aspects of the Amazon EC2 services and/or the CERNVM project. User probably needs root access to the VM instance to monitor and maintain their environment. This may be OK if the VM does not have access to the site batch system or site file system.
- Model 3 is where individual users are producing their own images. Difficult to see how this could be done in a trustworthy way except for full containment of the running image.
- For model 1, where the user DOES NOT have root, this is no different from what we have today.
- But model 2, it should first be clear exactly what it is.
- For now: concentrate on model 1 first, and concentrate on the creation of the images.
- Concentrate on the producer and the endorser, and we ignore the 'running' part of the life cycle. At the site: deployment and configuration needs to be added before running the machine. There is a maintenance issue there.
- Possible roles (people may then get more than one role): Who has root access? Maintainer? Operator? VM Manager? The one who looks after running the VM at the site and take care of tracability.
- Producer: is this an individual or a role? May want the producer to sign an agreement? Must be supervised by a person, even if it is an automated process creating the images. And (s)he should be authoritative?
- For the point of building trust, we need a name? Or can it be a group of people?
- The trust will be on the endorser, which even also may be a group of people.
- Policy requirements come on the endorser, and the producer (say, RedHat) may not even be aware of the policy. The endorser takes up the responsibility.
- There are then requirements on the endorser, like
- Response time
- Ability to actually subsume liability (i.e. a natural person or an chartered organisation, etc)
- That is not in the definition of the endorser, but should be in the requirements later down in the policy
- In order to prove endorsement, there should be a signature which will have to be a single signature. If the endorser has to sign, it would become a person who is technically capable, which may be a different role. Whereas you also want somebody high enough to be actually responsible and take that responsibility. A higher-level endorser probably cannot be bothered to re-sign every time a new VM is produced. The signature may be paper based and an out-of-band process. And he should sign the actual signing key. And key is then used by the Producer??
- The endorser is responsible for the fact that the image is produced in a way compliant with the policy, but
liability remains a problem. In the end it comes down to trust, since sanctions will be undefined in this context.
- Who has access to root accounts? e.g. having root in a VM could mean full control of the batch system.
- If access to resources is needed, there may need to be pre-installed credentials of some form, but in general NO credentials should be pre-installed.
- The level of containment depends on the hosting environment, and these have different security implications (e.g. for network access on ports <1024).
- The difference between policy and reality is growing, and the extent of the gap is unclear.
- Our computer centres are far more integrated than the generic hosting environments that Amazon et al. have set up.
- Aim is to create user environments so, by definition, the target site cannot create them.
- Why is the VM created by a site? Comes down to the 'approver ' of the image. Anyone who approves?
- Who is in control, i.e. who will ensure that requirements are met?
- User community is not necessarily affiliated with a Site, e.g. Atlas is not part of CERN/IT.
- Currently, virtual machine environments are not necessarily self-contained. Some hypervisors run with root and
grant network access.
- Also, user jobs now have access to shared file systems, i.e. have access to trusted parts of a site 's network. This is not compatible with the current site setups.
- Jules: the review mechanism is fragile, so better containment is a better way, making the VM like a current user job.
- Users will want configurable machines, that also incorporate parts that contain their own code and click their VM together. Do these need to be reviewed? Does that scale?
- Users should not be able to add stuff to the VM as root.
- Aim is to make images that are sufficiently trustworthy to convince sites to run them. (above running arbitrary images).
- Tying a producer to an institute (not a 'grid Site') is to have a binding with some entity that is persistent.
- Images either need to be built almost daily, as they need patches, or they need to be patched when started.
- What to do with images that contain a historic operatic system without patches. Should not be accepted by sites.
- There should be a mechanism to 'revoke' an image, as well as a TTL.
- Look also at P2P systems and their use of an index with a hash.
- The sites need to check the running images and check if the image appeared on the black list.
- Traceability is unexplored territory. Which machines was behind an address? What machine was run when where? In case of hypervisor exploit, you also may need to rebuild the machine the VM ran on!
- Who is starting the image, and how, is not yet discussed in the 'HepIX' working group.
- In the MUPJ policy, there are several parties that all must agree: the Grid as well as sites. But since in this policy we
only look at the producers, and not at the people who run them, there are less agreements to be made.
- There is however trust needed on the Endorser. This relieves the RPs from relying on the Producer, since the Endorser has to have done the checking. And you trust all endorsers?
- Compatable to the choice a site makes when trusting a distribution, like RedHat, to produce a reasonable operating system. There you trust RedHat to endorse this image.
- There is a mechanism needed to 'remove' endorsers that have proven to be untrustworthy, and have them as well as all their endorsed images out of a repository.
- A virtual machine image MUST be bound to an identifiable endorser. This may need an additional signature from the endorser on the image.
- The endorser should disclose the policies and practices according to which images are verified and endorsed.
- And the policy should set minimum requirements on these policies and procedures. The policy components go in this current document. c.f. RedHat will have a policy that governs access to the enterprise package signing keys.
- VM Image must be:
- Identifiable (e.g. by a digest)
- Lifetime limited (or not?)
- Revocable at any time
- Revocation data must have a life time in itself
- Revocation information must not be replayable (i.e. have a serial or an issue date/time)
- The model on patching is that the producer would do the patching, not the receiving site. That triggers the life time requirements.
- Essentially, you end up with a "CRL" with not only negative but also positive entries. Or only positive entries?
- The endorser must maintain an up to date signed list of currently valid images - which meets the requirements on life time itself and the list must have an expiration date and a issue date/serial number, and must be signed.
- The set of endorsers is the 'root of trust' and the endorser needs bookkeeping of all images in the image list, so that only recent and endorsed immages are in the list.
- But then the image itself does not need an expiration date. The digest of the VM is in the signed list.
- Patching of images is now impossible, since that will modify the signature and invalidate the image.
- This now defines the change between the various classes
- For patches at run time, you essentially cannot make a consistent snapshot. Trivial solution is to cancel and restart the jobs. Modified images are hard to re-endorse.
- Since the running site should not change it, it becomes a site issue. The proper solution would be to just kill the job, but then some sites in practice don't kill jobs when they would need to.
- By modifying the image, it comes a site image and would not need an endorsement any more.
- Once the image is downloaded by a site, different (site) policies apply. Not necessarily this Policy. Endorsements are relevant only to new virtual machines. We are in "case 1" anyway, where the site has root access.
- The site needs to keep track of which images are running. So they can check periodically with the signed positive list, and if the original image is removed from the list something has happened and the site can decide what it does, either just shut them down or looking into it and patch it and keep it running. That is an option that is outside the scope of this policy.
- The Consumer cannot hold the endorser liable for images that are no longer on the endorsed list. This addresses what happens to running images.
- Endorsers must have a vulnerability assessment capability.
- The Endorser must ensure that the Producer builds the image in a way that enables enough traceability for incident response.
- Should the endorser make the configuration (package lists, versions, last update date) available? Yes, but to whom?
- The list they must have to be there anyway, since the endorser must have a vulnerability assessment process that needs this information anyway. And the machine most likely itself has the data inside. If they are not maintaining this, the endorser is not trustworthy. A package list with versions would be a good implementation of this.
- "Must request and respect", i.e. to keep out users that the sites does not want to have. This is for the 'model 1' where the VM would run inside the trusted fabric as part of the batch system.
- And accounting in this way? Accounting the complete VM seems the logical way forward. That's the main issue with model 1 in the first place. Actually, no site should want to run in this model anyway, since it's not doable.
- Software licensing issues are also open. Endorser is fully liable for license compliance for anything in the image, e.g. if you include a piece of GPL code, you will be required to make the full source available -- since you redistribute the GPL software. This issue partly goes away if you run the image itself, instead of requiring the site to run it. So this is another problem with model 1.
- Read access control to repository is also needed, especially for licensing and private software.
- There are also security considerations in making the images world-readable. But distributing software in a confidential manner would be really difficult.
Discussion at the HEPiX Virtualisation Working Group meeting on 29 March 2010
Extracted from the minutes:
- There are two models for distribution
- signed images, with a signed list of endorsed images
- certificates assigned to individual images
- Tony asked that the issue of having individual certificates for images be resolved by the security group, rather then the distribution group. This would avoid discussing it in two different places. This was agreed.
- Tony commented that revocation by endorsers is preferable to relying on expiry dates because this will require active evolvement from the endorsers. If endorsers fail to revoke their images in a timely fashion we will know not to trust them.
- Ewan commented that having expiry was a failsafe mechanism for cases such as people forgetting about an image or leaving their job. John commented that this was the point he way trying to make.
- There was some discussion about if running VM instances should be patched. Not sure is there was a conclusion here; the group needs to decide if image sites can patch images or should simply refuse to instantiate an image if they have concerns. Romain said that some fixes might not require a reboot. Tony commented that in practice this does not happen frequently i.e. nodes are normally taken offline and rebooted for security patches. Romain agreed.
- Tony commented that his preference was that we create something that is usable soon, and may not be as automated as we would like. We can work on improving it after we have something that is usable.
Other issues discussed on the HEPiX WG mail list 27 March to 8 April 2010
- Thomas 27 March - comments on the JSPG draft policy version 1
- Sander 31 March
- BigGrid paper
- Can we rename "consumer"?
- What are the trust models for Amazon EC2 and BOINC?
- Sander 3 April - Separate out root and user issues
