# ESSI

This web page is for information related to the 2010 ESSI meeting held on August 2, 3, and 4. The main ESSI page is http://essi.gsfc.nasa.gov/. Last year's presentations are here: http://essi.gsfc.nasa.gov/umbc_workshop/presentations.html

AGU's ESSI Focus Group serves to facilitate communications and coordinate activities related to issues of data management and analysis, large-scale computational experimentation and modeling, and the hardware and software infrastructure needs to span the range of scientific topics of interest to the Union.

Following this meeting, the Heliophysics Data Model Consortium will hold a meeting in the same room, from Wednesday afternoon through Friday morning. See HDMC for details.

## 1. Workshop Summary

• Software
• We need more than data, data must be translated into knowledge - this can be accomplished through Virtual Organizations
• Virtual Organizations combine data access with social collaboration tools such as Drupal, wikis, blogs, and forums
• However, bringing domain scientists to the social network is the most challenging part of the problem
• Software needs to be open from the start with an emphasis on integration
• The use of a few fundamental data standards promotes software reuse and system interoperability. For example a common data dictionary standard allows the sharing of a controlled vocabulary between potential collaborators and allows a shared data dictionary software library to be developed.
• There are various levels of service to information systems
• What you get from an investigator (data/metadata) dictates what your system can do
• There is a long tail distribution for levels of service - a few high volume data sets are well described and thus offer a number of services to be built on top. However, there a lot of small non-continuous data sets which are not well documented.
• How the PI thinks about data may not be how the community wants to use it
• Informaticist, computer scientist, software engineer, and programmer are not the same thing. Yet, they tend to be lumped together. However, all are vital to the success of projects.
• The people who need to hear informatics talks are not coming.
• Reuse Readiness Levels (RRL) are a means of quantifying various facets of software reuse. They allow developers to describe numerous attributes of software and allow potential users to evaluate the degree to which the software could be reused.
• This builds on the notion of technology readiness levels which are common in hardware reuse. Prior to the RRL effort, efforts to "improve" software where difficult to quantify. http://esdswg.org/softwarereuse[RRL Web Site]
• Cloud computing and IaaS are beginning to be adopted by Earth and Space scientists. This paradigm was described with an overview of the space weather cloud applications of the University of Alberta and UCLA.
• Visualization
• Various 3D visualization techniques where described and a still unresolved question was posed - Can 3D visualization become a commodity as a service? Transferring all the data to the client is not feasible, however, advances are being made that allow significant science to be done with client side scripting and slices/subsets of data. It is, however, difficult to customize such applications.
• No one protocol or standard allows for the transmission of image data and provenance. These different functions are being sent over different channels, yet this makes it difficult for the client application to combine and make effective use of all the information. Integrating the many facets of data visualization and manipulation is still a challenging problem.
• Many software applications attempt to be complete solutions. An argument was made that software reuse would be better facilitated by modularizing plotting/visualization components.
• Another discussion focused on the need for all data sets to have unique IDs such that they can be cited and their provenance can be tracked. Within the visualization context this lead to comments that visualization should/will transition to plotting by ID and not by URLs as is common today. Systems will need to make use of infrastructure that maps IDs to one or more physical locations.
• Data Mining
• Uniqueness of instruments is a common difficulty in data mining over several data sets, which appear to be similar.
• Oct 5-7 NASA/AMES Conference on Intelligent Data Understanding
• Outlier detection can reveal
• Novelty - new event or feature
• Anomaly - error in the processing algorithm
• Quality Assurance - is the data system pipeline working
• Semantic Web
• The semantic web is focused on the meaning of data and the relationships between various data sets. While semantic web applications are used in a number of places, there is very little linking between data sets and across disciplines. Is this correlated with how funding is given?
• An ontology can be used to drive the development of an information system from the generation of standards documents to the production of XML data annotation and semantic query models.
• Outstanding issues within the semantic web
• performance is slow on desktop computers and traditional servers as one approaches large triple counts
• how to extract domain knowledge from scientists is still an open issue. It is a difficult and tedious process
• where is all the linking of data?
• Governance
• Informatics is a new field and a number of parallel efforts are underway to create governance
• in some aspects we are competing with each other for resources
• Examples of governance organizations include NGC (NSF) and ESIP (NASA)
• Examples of grass roots efforts in space physics modeling were given and it was emphasized that these groups are successful because they are very focused and members have a sense of ownership.
• There was a discussion of whether a new organization is needed or if one of the existing groups (ESSI or ESIP) could be extended/augmented to address the growing informatics field. There were good arguments on both sides and no clear winner.
• ESIP has an earth science membership and mandate. Some thought ESIP would be receptive to space physics while others did not.
• AGU is starting to see informatics as its own field, however, many felt it would always be secondary to other AGU sections. We end up talking to our selves at AGU sessions and the size and scope of the meeting makes it difficult to attract domain scientists. Other disagreed and felt AGU was the ideal place for informatics/domain science interaction - forming yet another group with yet another meeting will only dilute our efforts.
• domain science and technology have to be co-designed
• cross-disciplinary work is difficult because we don't know where all the commonalities are

# 2. Logistics

## 2.1. Getting here

• Meeting is located in the Johnson Center DH Gold Room (On Tuesday we will meet in 3rd Floor, Meeting Rm F., but I will put a sign on this room's door pointing to the Tuesday-only room.) The Johnson Center has many restaurants
• Campus map shown Johnson Center http://eagle.gmu.edu/map/fairfax.php?building=johnson_center
• Directions to George Mason: http://www.gmu.edu/resources/welcome/Directions-to-GMU.html (GMU's address is 4400 University Drive, Fairfax, VA 22030)
• To get to Mason from the subway, take the Orange Line to the last stop. You may take a cab ($15) or a bus ($2.0) [1]. The bus will take about 30 minutes and the cab about 15 minutes. It takes about 45 minutes on the subway from Metro Station to Fairfax.

## 2.2. Parking here

Please park in Lot K shown on this map: http://parking.gmu.edu/pdf%20files/parkingmap09.pdf

The meeting is located in the Johnson Center DH Gold Room (Building 30 on the parking map).

You may pick up your parking permit in building 44 on the map. The building opens at 8:30am. Tell them you are with the ESSI/Weigel conference. Please let rweigel@gmu.edu or Tom Narock know that you will need a parking pass.

## 2.3. Eating Here

Suggested Locations: [2]

essi, mSn2cw44

## 2.5. Misc.

The room will be open from 8am - 7pm.

On Tuesday, the meeting is in JC 3rd Floor Meeting Rm F.

# 3. Format of Meeting

This will be an "active" conference, which means that the audience is expected to participate during about half of the time instead of watching presentation for the entire day. The idea of the open discussion is that it should be like a conversation with a few colleagues at a table at a meeting - you disuss an issue and possibly show a few documents or figures for support. You do not give a formal "stand-up" presentation, but rather exchange ideas and information informally.

Two recent astroinformatics workshops have been (or will be) handled in a similar way (it is much more stimulating and productive than a formal agenda of speakers)

Assignments for people attending the meeting:

• Before the meeting, put links up to documents or web pages that may be of interest in the relevant section of #Topics
• Prepare a 5-15 minute presentation containing remarks about your view of the session topic
• Solicit links for documents and web pages from colleagues that you know will be attending
• Please be aware that the audience will consist of scientists and software developers from the Earth Sciences, Space Sciences, Astronomy, and Astroinformatics. At the very least, be careful about jargon.
• During the meeting, please ask someone to keep and post notes about questions asked to the speaker or discussions during the open forum.

Audience

• Post to the wiki (#Posting_and_Uploading_Files) or send links to documents and web pages to the session leader that you think you may want to talk about during the discussion session (2-5 minutes max)
• Do the same for powerpoints with slides that you may want to show during the open discussion.
• Post comments and notes on the wiki during and after the meeting

In preparation for the open discussion, please post links, comments, and a few slides that you would like to show on the wiki under the appropriate section:

To upload a file, edit the page and enter text similar to this [[Image:My_File_Name.ppt|Description of My File]]. When you select save, you will see that there is now a red link as in Image:My File Name.ppt. When you select the red link, you will be taken to an upload page (or a login page if you have not created an account. If you do not want to create an account, use username=ESSI2010 and password=2010ESSI).

# 5. Schedule

Monday

• 9:00 - 9:15 Introduction/Logistics
• 9:15 - 9:30 Software Reuse/Open Development (Overview) - Bob Weigel
• 9:30 - 9:45 From Data to Knowledge - Weiss, et al.
• 9:45 - 10:00 Levels of Service - Ruth Duerr
• 10:00 - 10:15 Untitled presentation - Joe Hourclé
• 10:15 - 10:30 open discussion
• 10:30 - 10:45 Clouds, Grids and other Computational Frameworks for Science - Toews, et al.
• 10:45 - 11:00 Tools for Reusing Earth Science Software, Downs et al.
• 11:00 - 11:15 FAIR-TRADE: A Framework to Share Digital and Computational Resources - King, et al.
• 11:15 - 12:00 open discussion/break
• 12:00 - 12:15 Lunch
• 12:15 - 12:30 Lunch
• 12:30 - 12:45 Lunch
• 12:45 - 1:00 Lunch
• 1:00 - 1:15 Lunch
• 1:15 - 1:30 Lunch
• 1:30 - 1:45 An overview of the PDS 2010 infrastructure - Hughes
• 1:45 - 2:00 Fundamental Data Standards for Interoperability - Hughes
• 2:00 - 2:15 Software Open Discussion
• 2:15 - 2:30 Software Open Discussion
• 2:30 - 2:45 Software Open Discussion
• 2:45 - 3:00 Software Open Discussion
• 3:00 - 3:15 Software Open Discussion
• 3:15 - 3:30 Break
• 3:30 - 3:45 Provenance (Overview) Curt Tilmes
• 3:45 - 4:00 Gregory Leptoukh
• 4:00 - 4:15 Provenance open discussion
• 4:15 - 4:30 Provenance open discussion
• 4:30 - 4:45 Provenance open discussion
• 4:45 - 5:00 Provenance open discussion
• 5:00 - 5:15 Provenance open discussion

Tuesday (On Tuesday, the meeting is in JC 3rd Floor Meeting Rm F.)

• 9:00 - 9:15 Visualization (Overview) Mike Wiltberger
• 9:15 - 9:30 Gregory Leptoukh
• 9:30 - 9:45 Jeremy Faden
• 9:45 - 10:00 Asher Pembroke
• 10:00 - 10:15 Visualization Open Discussion
• 10:15 - 10:30 Visualization Open Discussion
• 10:30 - 10:45 Visualization Open Discussion
• 10:45 - 11:00 Visualization Open Discussion
• 11:00 - 11:15 Visualization Open Discussion
• 11:15 - 11:30 Break
• 11:30 - 11:45 Data Mining, Machine Learning, and Data Analysis (Overview) Guido Cervone
• 11:45 - 12:00 Kirk Borne
• 12:00 - 12:15 Lunch
• 12:15 - 12:30 Lunch
• 12:30 - 12:45 Lunch
• 12:45 - 1:00 Lunch
• 1:00 - 1:15 Lunch
• 1:15 - 1:30 Lunch
• 1:30 - 1:45 Data mining for source detection of atmospheric pollutants - Guido Cervone
• 1:45 - 2:00 Data Mining open discussion
• 2:00 - 2:15 Data Mining open discussion
• 2:15 - 2:30 Data Mining open discussion
• 2:30 - 2:45 Data Mining open discussion
• 2:45 - 3:00 Break
• 3:00 - 3:15 Semantic Web (Overview) Chris Lynnes
• 3:15 - 3:30 Development of the Next Generation PDS Data Standards - J. Steven Hughes and Daniel Crichton
• 3:30 - 3:45 Gregory Leptoukh
• 3:45 - 4:00 Semantic Web Open Discussion
• 4:00 - 4:15 Semantic Web Open Discussion
• 4:15 - 4:30 Semantic Web Open Discussion
• 4:30 - 4:45 Semantic Web Open Discussion

Wednesday

• 9:00 - 9:15 Community organization and governance (Overview) Kerstin Lehnert
• 9:15 - 9:30 "Toward Global Implementation of the International GeoSample Number IGSN - Kerstin Lehnert, Jens Klump, Celine Chan"
• 9:30 - 9:45 Sustainable Governance for Long-Term Stewardship of Earth Science Data - Robert R. Downs and Robert S. Chen
• 9:45 - 10:00 "GeoPass - Single sign-on for distributed Geoinformatics systems - Celine Chan, Dion Ridley, Kerstin Lehnert"
• 10:00 - 10:15 Governance Open Discussion
• 10:15 - 10:30 Governance Open Discussion
• 10:30 - 10:45 Governance Open Discussion
• 10:45 - 11:00 Governance Open Discussion
• 11:00 - 11:15 Governance Open Discussion
• 11:15 - 11:30 Break
• 11:30 - 11:45 Informatics in Education - An Education in Informatics - Kirk Borne
• 11:45 - 12:00 Education open discussion
• 12:00 - 12:15 Education open discussion
• 12:15 - 12:30 Education open discussion

# 6. Topics

## 6.1. Software Reuse/Open Development

### 6.1.1. Weigel

Software Reuse/Open Development (Overview)

Went over a few items given at [3] and discussed General links: [4] [5]

### 6.1.6. Toews

Clouds, Grids and other Computational Frameworks for Science

### 6.1.10. Other Notes

• Paper based on presentation in 2009 meeting: "Relevance of software reuse in building advanced scientific data processing systems" [6]
• In my opinion (Tom Narock) informatics research (especially software development) should follow the framework outlined in these papers
• March, S. T. and Smith, G. F., "Design and Natural Science Research on Information Technology," Decision Support Systems, vol 15, no 4 (1995), pp 251-266. [7]
• Hevner, A. R. and March, S. T., "The Information Systems Research Cycle," IEEE Computer, November 2003, pp 109-111. [8]
• Hevner, A., March, S. T., Park, J. and Ram, S. "Design Science Research in Information Systems," MIS Quarterly (28:1) March 2004, pp. 75-105.[9]
• Example collaborative workspace from BioInformatics world: myExperiment
• Collection of papers about Scientific Collaboration: Scientific Collaboration on the internet
• Information science studies referred to virtual organizations for science research as 'collaboratories', and there have been a number of articles on the subject since the mid 1990's (see Science Collaboration book above for its bibliography)

## 6.2. Provenance

• A good introduction into aspects of provenance affecting Earth science data: [10]
• The same data are commonly accessible through a number of web-based tools. How does one ensure that proper attribution is given to the originator of the data?
• Does one cite all tools and methods used to access data? If yes, are there standards for such attribution?
• How does one ensure that two electronic versions of data are indeed the same?
• Is complete reproducibility of a study feasible in the digital age?

## 6.3. Visualization

### 6.3.2. Leptoukh

Lessons learned and perspectives gained while developing Giovanni [11]

• Package versus API
• Using just the visualization part
• The more powerful, the less control - need to find middle ground
• Adjusting Look -> publication quality?
• Google API not suitable for science
• Use inside workflow
• Visualization vs. WMS (+provenance)
• REST protocol for visualization
• 3D interactive vs. performance

Question to audience - "Other protocols for serving scientific plots?"

Discussion - One problem is that "WMS is provenance hostile" because it does not have it as a part of the protocol. Could include it in the metadata, but clients don't handle it. Need to get standards organization to put such things into the standards. One audience member noted that including both provenance and visualization is difficult because of "real estate" because it is difficult to show the information at the same time.

### 6.3.5. General Questions

• What is the best way to make web-ready movies of static png images?
• If you were taking an undergraduate course on scientific visualization, what software would you want to use?
• What are a few visualization-related topics that you need to teach new interns of student researchers?

Links shown in the open discussion

## 6.4. Education

Citizen Science:

Science Education:

## 6.6. Semantic Web

• What semantic web applications are available and what have we learned from this new technology?
• Will semantic web applications eventually be common place within informatics? Or, is there a specific niche for these tools?

### 6.6.4. King

Using Linked Data to create a semantic web? (Todd King) ppt

## 6.7. Data Mining

Scientific Data Mining:

Science Informatics: