Talk:Grid Policy on the Handling of User-Level Job Accounting Data
From JSPGwiki
| Table of contents |
Jan 23 2009 JSPG Meeting Notes
This policy is likely EGEE/WLCG focused. OSG recently approved its own Privacy Policy (http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=741). There is a lot of activity in this area in educational federations in Europe.
Informed consent isn't enough. We're in the "contractually required" situation, i.e., not an opt-out situation. So, the policy no longer discusses user consent but rather how users are informed.
Private Data: Anything that identifies directly or indirectly an individual. Dave Kelsey showed slides on this topic at the last meeting (http://indico.cern.ch/conferenceDisplay.py?confId=38223).
About half of EGEE sites are already releasing user-level accounting information, and we hope this policy captures what they are doing.
Inside one site, we assume site local procedures and policies are obeyed. The issue is when jobs and accounting data cross sites and countries.
VO management needs to know who within their VO is using resources.
A general privacy policy for EGEE/WLCG is too difficult. We tried that. We focus first on user-level job accounting data.
WLCG MOUs indicate that user-level accounting data will be shared.
Who is the target audience for this policy?
- Site Managers. Added a section on Scope.
- Do we describe what/how/where or do we specify requirements (MUST, MAY, ...)?
- It seems we're still confused about this.
Next we'll ask for comments on the JSPG mailing list.
--JimBasney 23 Jan 2009
Feedback received, for discussion at May 2009 JSPG meeting
Steven Newhouse (CERN): It mentions in section 10 on the non-transfer of data outside of the EU. How is the EU defined in terms of EGI? EU27 or something else? Has the role of EGI in this document come up in discussion?
[JSPG response: The important thing is meeting the requirements of EU Directive 95/46/EC. Section 10 has now been modified to make this clear. It is EU27, plus EEA plus some other countries. The policy is now very much an EU policy rather than Global in scope, but this should serve as a good starting point for EGI.]
Ursula Epting (Karlsruhe): Section 4 Accounting data storage says "Each site collects and stores an accounting record for each job executed at their site. These records are stored locally at the site." Possibly a reference could be added that the records are/should be protected according to national data privacy laws (if existing)?
[JSPG response: Agreed. Section 4 modified.]
GDB discussion 11 March 2009: Many countries/funding agencies also want to view use of their local resources by other nationalities and what resources have been used on foreign resources by their own nationals. It was agreed by GDB that this is for a later "Phase 2" and does not need to be covered in the current policy.
Ruth Pordes (OSG) GDB 11 March: User accounting data (in OSG) is currently public. Acceptance of the policy will result in a plan to move the data from gratia to Apel to some timeline. This development will take some time (competing with other needs).
Cécile Barbier (IN2P3): Members of the French Accounting Working Group have attented your talk at last GDB meeting at CERN (March 11) and read the new versions of all JSPG documents and particularly the document that you wrote : Grid Policy on the Handling of User-Level Job Accounting Data.
According to paragraph 'Control of and access rights to the information' of this document, we can identify 3 levels of aggregated data in the GOC:
- VO level
- VO group/role level
- user level
From paragraph 'The purpose and reasons for the collection and storage of the information', we can understand the need for 'anonymised and aggregated accounting' which may justify long term retention. On another hand, in paragraph 'The period of retention', it is written that the GOC is responsible for preserving the aggregated data for as long as this is required by the VO or grid management.
This brings up several questions :
- How long can this period of retention be?
- Are there upper and lower limits?
[David Kelsey answer: I think we imagined that the project would want to keep (some of the) aggregated data essentially for ever. The idea was that this would be separately agreed by the Grid and VO and not set in the policy. Do you think we need to define more carefully?]
[Reply from France: Maybe the GOC could set a maximum default retention period for user level aggregated data (if it hasn't been set yet). Then, if the Grid management and the VOs would like to set another value for this limit, they could specify it in their policies and maybe write that it is done for statistical and scientific purposes. For anonymous aggregated data, we don't think that setting a maximum retention period is very important, precisely because data are anonymous.]
- Is this period the same for all 3 levels of aggregated data?
- Can user level data be stored shorter and what is the upper limit?
[David Kelsey answer: I can see big advantages in setting a limit for user-level aggregated data in the policy. What time would you think is a good limit? 12 months? I think I need to look more closey at exactly what types of aggregated data are actually (today) being stored.]
[Reply from France: We have just checked our local policy at CNRS (French National Scientific Research Center) and it appears that personal data can be stored 12 months at most. So 12 months would be ideal to us as it is the same value as in our local policy. If this time limit should be more than 12 months, we would probably need to ask the CNIL (French Data Protection Authority) for an advice about it.]
In other words, is it possible to ensure that only "anonymised data" will be kept as long as required by the VO or grid management? Is it possible to assume a "right to oblivion"?
Our concern is that we need to suit our local data retention policy and currently, we cannot garanty that GOC is fully compliant with this policy regarding the aggregated data computed from the records that our sites provided.
[JSPG response. Agreed. Section 7 on retention has been modified to make clear that the user-level personal data needs to be either deleted or anonymised after 12 months.]
--David kelsey 12:10, 11 May 2009 (BST)
May 14-15 2009 JSPG Meeting Notes
Reviewed feedback received (see above).
Should this be an EGEE-only policy, given that USA privacy policies are so different?
- It's not clear that OSG would adopt it.
Updated accounting data storage section as suggested above by Ursula and Cécile.
Are references to GOC too EGEE-specific?
- Should the policy say "The Grid securely stores..." rather than "The GOC securely stores..."?
- Editing accounting data storage section to be less EU-specific?
- Rather than GOC, should we use Accounting Data Center (ADC)? It describes anywhere that holds accounting data. Agreed.
Edited period of retention section.
Are people entitled to look at the data in the EU from the US?
- The VO resource manager must be Safe Harbor members?
Is your certificate DN personal data?
- Yes, because it contains your name.
This policy has been reviewed by GDB, but it has not yet gone for wider consultation. It should go to EGEE and others.
To what extent to VOs have a say in what happens?
- Are we going to do user-level accounting for all VOs?
- A site is allowed to send accounting data, but VOs must sign up?
- In EGEE, can sites configure accounting per-VO? Probably not.
- "The VO SHOULD have a choice."
- Storing the data locally is not an issue for this policy. That occurs in any case. This policy is about data being centrally collected by the Grid.
When we refer to EU, it is "the European Economic Area and countries that comply with 9546EC, the 3 EEA member countries, and those countries that have adequate protection of personal data on the basis of a commission ...."
- Should this text go in the policy? Or can we reference 95/46/EC?
- Is there a single name for this group? No.
- Is this text too confusing? It can be helpful to point readers to the specific 95/46/EC directive.
Is the tie to EU or 95/46/EC a MUST? This ties the policy to the EU and makes it inapplicable to JSPG members outside the EU.
- Should we put EU in the title?
- Better to change the scope. Done.
Can sites retrieve accounting data from the accounting data center? Can a grid site outside the EU request accounting data from other sites that are inside the EU? That would be a transfer of data outside the EU. Or what about an individual traveling outside the EU?
- Access to aggregate VO data is public. Only VO members can see more specific job data. Only VO resource managers can see the person's name.
- We don't say anything about where the VO resource managers are located? They must sign this document.
- The EGEE accounting portal (in Spain) allows data access from anywhere?
This all depends on the AUP clause about data privacy.
Is this policy helpful if it's not vetted by lawyers?
- WLCG MB is putting the responsibility on each of the sites. They are requiring the data to be published, and the sites must decide if they can do it.
- It's not clear that we (JSPG) can make a meaningful statement on this topic, given the confusing state of laws in different countries.
By contrast, Federation-based systems are avoiding the release of private information.
User-level job accounting is motivated by the VO manager's desire to know who used the VO's resources. (See the section on "reasons for the collection".)
How do I know who has signed this agreement? What is the practical implication of the final statement in the policy?
- Policy modified to allow disclosure to anyone with legitimate access.
Is this ready for wider consultation? Agreed.
--Jim Basney 14-15 May 2009
Changes between versions 0.7 and 0.8
- Section 2 Scope now explains that this is aimed at EU-based Grids.
- Section 4 Accounting data storage modified to say local records must satisfy local data privacy laws.
- Section 4 now defines an Accounting Data Centre (ADC) rather than using GOC for this location.
- Section 7 Period of retention now makes it clear that user-level personal data needs to be either deleted or anonymised after a period of 12 months.
- Section 10 International borders has been strenghtened to describe where the ADC needs to be located in terms of meeting the requirements of EU Directive 95/46/EC.
- Section 12 Signatures no longer requires individuals to know whether other individuals have signed the policy.
--David kelsey 18:35, 20 May 2009 (BST)
Feedback received and JSPG responses on version V0.8, 15 May 2009
Stefano Belforte (INFN): Should this have a definition of what Accounting Data are? From "Accounting data storage" section it reads that accounting data are generated by the site. This seems to exclude all data recorded for monitoring purpose by e.g. pilot job frameworks. Is that so?
[JSPG response: We have added a sentence in the "Scope" section to make it clear that this policy refers only to job accounting data collected by the Grid and explicitly stating that it does not apply to other forms of accounting or monitoring. We should stress that any individual or body who processes accounting or monitoring data containing personal identifying information is subject to data protection laws. It is their responsibility to consider what procedures and documents are needed.]
John Gordon (RAL): Could I put in a word for retention for 13 months? I have previously found logs that time out exactly on sensible boundaries frustrating. When you try to run a report on a previous month on the 1st of the next month and find that you have already lost records from the 1st of the previous month for example. I don't expect we will want to run reports on individuals for a year but people may want to look at their own use in the previous year. Extending it a little can be useful. It gives you January to analyse the previous year.
[JSPG response: We considered this, but feedback suggested that any period of retention over a year may require (some) sites to consult local data protection officers. We decided to stay with one year as the limit (expressed as one year rather than 12 months).]
Tiziana Ferrari (CNAF, on behalf of IT ROC):
- "Each site is responsible for sending its accounting records on a regular basis, e.g. daily, with at least user DNs encrypted in transport, to a central data base defined by the Grid. This database is located at an Accounting Data Centre (ADC). The location of the ADC needs to be chosen carefully according to data privacy laws." It would be useful if the document could clarify the concept of "central DB", as in the EGI era every NGI is requested to have a national collection point, and actually some proto-NGIs already have one. In addition, "the Grid" is a generic term that should be clarified: what is the legal entity responsible of defining the "central data base"? Shouldn't we be more flexible and allow the transmission of aggregated usage records between national boundaries? this is also implicitly assumed in the development plans of the APEL development team.
- "Each site is responsible for sending its accounting records on a regular basis, e.g. daily, with at least user DNs encrypted in transport, to a central data base defined by the Grid." We should leave the freedom to let the VO choose if its usage records need/can be published in the ADC. If necessary, the VO may choose not to publish centrally if required for privacy reasons, or may require to publish in an ad-hoc ADC. Consequently, the site should publish records according to the requirements of the VO, and not by default all the usage records from all VOs.
- "Access to aggregated data at the VO level is public information and requires no access control." Not necessarily. The usage records of a VO may not be published at all if requested by the VO.
- "The ADC is responsible for deleting the copies of the individual accounting records in the central database, or for removing or anonymising personal identifying information, e.g. subject DNs, from these records, at the latest one year after receipt of the date in the ADC." Why should we request that this information is removed "at the latest" after one year? why one year? shouldn't we request that those data are available for a given agreed minimum amount of time?
- "The ADC publishes user-level accounting data to the authorised VO Resource Managers." This statement has an impact on the implementation of the middleware. What minimum level of aggregation is requested for the user-level usage records?
- "The individual job accounting records are transferred between the sites and the central database at the ADC. Many of these transfers cross international borders." We should clarify that those individual accounting records "are" transferred across NGI boundaries only if the VO requests this.
- "The user name is encrypted before it is sent across the network." Is this requirement truly needed to meet any law? the DN does not include any private element of information, does it?
[JSPG response. Comment 1: We have added text to the section on "Accounting data storage" to make it clear that there may be multiple ADCs within a Grid, that exchange of accounting data between multiple ADCs is covered. This policy document is not able to specify all possible future accounting models, but we have tried to make the document general enough to allow a good range of possible future requirements. As John Gordon explained - see below - we use the term "Grid" to make it general and useful to many Grids. Comment 2: We leave the Grid and VOs to agree what data needs to be published and what access control is required. This policy document is aimed at allowing it to happen, should all decide they want to publish. Comment 3: Agreed. Wording has been changed. Comment 4: Personal identifying information, such as a name, are subject to data protection laws. Advice from sites is that they will only transfer such information to an ADC if we have this policy document specifying certain aspects of the handling of this data. One critical component is the period of retention of personal information and feedback from at least one country is that periods longer than one year cause much more problems, e.g. requesting permission from local data protection officers. JSPG therefore decided to specify a maximum retention period of one year. Comment 5: we have given some examples of types of aggregations in the section on Accounting Data Storage, saying that these levels are defined by the Grid. Comment 6: A sentence has been added to Accounting Data Storage section. Comment 7: As John Gordon points out below, the users name is personal information and is covered by data protection laws.]
John Gordon (RAL) (in answer to Tiziana's comments): Comment 1: An attempt has been made to make the policies generic, i.e. not to name specific grids, so that the policies can be adopted in multiple grids and at multiple levels. The same policies could be adopted at EGI and NGI level. The particular grid could have a top level policy document which defines the term 'The Grid' in all the attached policies. Does the policy forbid the transition of aggregated records across national borders? Comment 3: Accounting is for both VO and management use. I do not think a VO can veto the collection of usage data by the underlying infrastructure. If a VO is not 'international' then it may not want its data sent to a repository outside the country but within the NGI it should not be able to veto what information the NGI requires. Comment 4: For legal reasons, some countries are more relaxed about data that is kept for a defined period. Comment 5: EGEE currently aggregates to Site/VO/Month/User so this is the maximum. I don't think a minimum has been defined. Comment 7: Since our DNs mainly include real names then they become 'personal data' in that they tell you what some real human being is doing.
[JSPG response: Many thanks to John for these useful replies. We agree!]
Tiziana Ferrari (CNAF, continuing the discussion with John Gordon): Comment 1: OK, I would suggest then to clarify this in Section 4, where a definition of Accounting Data Centre ADC is provided. In particular this section says "The ADC securely stores all the individual job records from each of the sites submitting such records." I'm not sure this is in general applicable at the EGI level. In addition, this list does not explicitly mentions that in case of a multi-level infrastructure, the ADC may need to send usage records itself to another ADC at an upper layer. "Does the policy forbid the transition of aggregated records across national borders?" No, nothing is said about this, this is the point. The document does not say what type of aggregation, if any, the ADC needs to appy. I would suggest to clarify. Comment 3: May not veto the collection of records, but may require that its records, even if aggregated, are not "public information" for all. In Italy we have use cases of this kind, and this is an important requirement for applications from the industry. Even if national in scope, it may still require that its accounting records are only accessible to a restricted list of people, such as few internal VO members and site managers. Comment 4: Fine with this, but I would emphasize that the retention policy has to be negotiated. Why one year? it could be more or less if this is accepted by all relevant parties.
[JSPG response. Comment 1: Wording has been added to clarify the situation with multiple ADCs. Transmission of aggregated data between ADCs is now foreseen, as long as both agree to be covered by this policy. Comment 3: As noted in our reply to you above, this policy document is aimed at being general. Grids and VOs are free to decide what they need to do. Comment 4: See our answer above. Retaining for less than one year is OK, as far as data protection legal issues are concerned. JSPG advises against retaining longer than one year as then some sites may be unwilling to publish accounting data.]
Cal Loomis (IN2P3): I think that it will be necessary to keep some details of the "type" of the user after the DN has been deleted. One that immediately comes to mind is keeping permanently the "country" (or probably more reasonably the CA) of the user. This allows some reasonable look at cross-country sharing of resources. I can imagine that this information will be needed "forever" to allow historical trends to be taken into account. Perhaps there are other details of the DN that should similarly be saved. Of course, we can't require too many separate fields otherwise the identity of the user could be deduced from the saved data.
[JSPG response: JSPG agrees that this could be a useful requirement, but this policy document cannot require it. We have changed the wording in the section on Period of Retention to make it clear that only personally identifying information needs to be removed or anonymised after one year, e.g. CommonName or e-mail address. Accounting systems may if they wish keep other attributes for longer.]
Stephen Burke (RAL):
- In practice a VOMS group or role may be enough to identify an individual uniquely, is that relevant? (At the extreme you could even have a VO with only one member.)
- What about user-level accounting information held by the VOs directly, e.g. in pilot job systems?
- And indeed what about the job records stored in the LB servers, and anything which reads from them (RTM?)?
[JSPG response. Comment 1: This could happen in theory. Any data which is able to identify a person is indeed covered by the data protection laws. Grids and VOs should consider this issue when they design an accounting system. Comment 2: This policy document does not address this issue - see answer above to Stefano Belforte. Comment 3: same is true for these other types of monitoring.]
--David kelsey 18:44, 18 Jun 2009 (BST)
