THCHAU —  Data Management   (13-Oct-11   14:00—15:30)
Chair: A. Buteau, SOLEIL, Gif-sur-Yvette, France
Paper Title Page
PaN-data, the Photon and Neutron Open Data Infrastructure Project  
  • R.D. Dimper
    ESRF, Grenoble, France
  The PaN-data collaboration brings together some of the major multidisciplinary research infrastructures in Europe to construct and operate a sustainable data infrastructure for the European Neutron and Photon laboratories. Such a unique infrastructure will enhance all research done in this community, by making data accessible, preserving the data, allowing experiments to be carried out jointly in several laboratories and by providing powerful tools for scientists to remotely interact with the data. The presentation will introduce the PaN-data FP7 project, highlight its potential benefits and then focus on the challenges which will be addressed in the months and years to come, like the management of large data rates and data sets, standardized annotation of the data, transparent and secure remote access to data, and long-term data preservation.  
slides icon Slides THCHAUST01 [10.301 MB]  
THCHAUST02 Large Scale Data Facility for Data Intensive Synchrotron Beamlines 1216
  • R. Stotzka, A. Garcia, V. Hartmann, T. Jejkal, H. Pasic, A. Streit, J. van Wezel
    KIT, Karlsruhe, Germany
  • D. Haas, W. Mexner, T. dos Santos Rolo
    Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
  ANKA is a large scale facility of the Helmholtz Association of National Research Centers in Germany located at the Karlsruhe Institute of Technology. As the synchrotron light source it is providing light from hard X-rays to the far-infrared for research and technology. It is serving as a user facility for the national and international scientific community currently producing 100 TB of data per year. Within the next two years a couple of additional data intensive beamlines will be operational producing up to 1.6 PB per year. These amounts of data have to be stored and provided on demand to the users. The Large Scale Data Facility LSDF is located on the same campus as ANKA. It is a data service facility dedicated for data intensive scientific experiments. Currently storage of 4 PB for unstructured and structured data and a HADOOP cluster as a computing resource for data intensive applications are available. Within the campus experiments and the main large data producing facilities are connected via 10 GE network links. An additional 10 GE link exists to the internet. Tools for an easy and transparent access allow scientists to use the LSDF without bothering with the internal structures and technologies. Open interfaces and APIs support a variety of access methods to the highly available services for high throughput data applications. In close cooperation with ANKA the LSDF provides assistance to efficiently organize data and meta data structures, and develops and deploys community specific software running on the directly connected computing infrastructure.  
slides icon Slides THCHAUST02 [1.294 MB]  
THCHAUST03 Common Data Model ; A Unified Layer to Access Data from Data Analysis Point of View 1220
  • N. Hauser, T.K. Lam, N. Xiong
    ANSTO, Menai, Australia
  • A. Buteau, M. Ounsy, S. Poirier
    SOLEIL, Gif-sur-Yvette, France
  • C. Rodriguez
    ALTEN, Boulogne-Billancourt, France
  For almost 20 years, the scientific community of neutrons and synchrotron facilities has been dreaming of using a common data format to be able to exchange experimental results and applications to analyse them. If using HDF5 as a physical container for data quickly raised a large consensus, the big issue is the standardisation of data organisation. By introducing a new level of indirection for data access, the CommonDataModel (CDM) framework offers a solution and allows to split development efforts and responsibilities between institutes. The CDM is made of a core API that accesses data through a data format plugins mechanism and scientific applications definitions (i.e. sets of logically organized keywords defined by scientists for each experimental technique). Using a innovative "mapping" system between applications definitions and physical data organizations, the CDM allows to develop data reduction applications regardless of data files formats AND organisations. Then each institute has to develop data access plugins for its own files formats along with the mapping between application definitions and its own data files organisation. Thus, data reduction applications can be developed from a strictly scientific point of view and are natively able to process data coming from several institutes. A concrete example on a SAXS data reduction application, accessing NeXus and EDF (ESRF Data Format) file will be commented.  
slides icon Slides THCHAUST03 [36.889 MB]  
THCHAUST04 Management of Experiments and Data at the National Ignition Facility 1224
  • S.G. Azevedo, R.G. Beeler, R.C. Bettenhausen, E.J. Bond, A.D. Casey, H.C. Chandrasekaran, C.B. Foxworthy, M.S. Hutton, J.E. Krammen, J.A. Liebman, A.A. Marsh, T. M. Pannell, D.E. Speck, J.D. Tappero, A.L. Warrick
    LLNL, Livermore, California, USA
  Funding: This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Experiments, or "shots", conducted at the National Ignition Facility (NIF) are discrete events that occur over a very short time frame (tens of ns) separated by hours. Each shot is part of a larger campaign of shots to advance scientific understanding in high-energy-density physics. In one campaign, energy from the 192-beam, 1.8-Megajoule pulsed laser in NIF will be used to implode a hydrogen-filled target to demonstrate controlled fusion. Each shot generates gigabytes of data from over 30 diagnostics that measure optical, x-ray, and nuclear phenomena from the imploding target. Because of the low duty cycle of shots, and the thousands of adjustments for each shot (target type, composition, shape; laser beams used, their power profiles, pointing; diagnostic systems used, their configuration, calibration, settings) it is imperative that we accurately define all equipment prior to the shot. Following the shot, and the data acquisition by the automatic control system, it is equally imperative that we archive, analyze and visualize the results within the required 30 minutes post-shot. Results must be securely stored, approved, web-visible and downloadable in order to facilitate subsequent publication. To-date NIF has successfully fired over 2,500 system shots, and thousands of test firings and dry-runs. We will present an overview of the highly-flexible and scalable campaign setup and management systems that control all aspects of the experimental NIF shot-cycle, from configuration of drive lasers all the way through presentation of analyzed results.
slides icon Slides THCHAUST04 [5.650 MB]  
THCHAUST05 LHCb Online Log Analysis and Maintenance System 1228
  • J.C. Garnier, L. Brarda, N. Neufeld, F. Nikolaidis
    CERN, Geneva, Switzerland
  History has shown, many times computer logs are the only information an administrator may have for an incident, which could be caused either by a malfunction or an attack. Due to huge amount of logs that are produced from large-scale IT infrastructures, such as LHCb Online, critical information may overlooked or simply be drowned in a sea of other messages . This clearly demonstrates the need for an automatic system for long-term maintenance and real time analysis of the logs. We have constructed a low cost, fault tolerant centralized logging system which is able to do in-depth analysis and cross-correlation of every log. This system is capable of handling O(10000) different log sources and numerous formats, while trying to keep the overhead as low as possible. It provides log gathering and management, offline analysis and online analysis. We call offline analysis the procedure of analyzing old logs for critical information, while Online analysis refer to the procedure of early alerting and reacting. The system is extensible and cooperates well with other applications such as Intrusion Detection / Prevention Systems. This paper presents the LHCb Online topology, problems we had to overcome and our solutions. Special emphasis is given to log analysis and how we use it for monitoring and how we can have uninterrupted access to the logs. We provide performance plots, code modification in well known log tools and our experience from trying various storage strategies.  
slides icon Slides THCHAUST05 [0.377 MB]  
THCHAUST06 Instrumentation of the CERN Accelerator Logging Service: Ensuring Performance, Scalability, Maintenance and Diagnostics 1232
  • C. Roderick, R. Billen, D.D. Teixeira
    CERN, Geneva, Switzerland
  The CERN accelerator Logging Service currently holds more than 90 terabytes of data online, and processes approximately 450 gigabytes per day, via hundreds of data loading processes and data extraction requests. This service is mission-critical for day-to-day operations, especially with respect to the tracking of live data from the LHC beam and equipment. In order to effectively manage any service, the service provider's goals should include knowing how the underlying systems are being used, in terms of: "Who is doing what, from where, using which applications and methods, and how long each action takes". Armed with such information, it is then possible to: analyze and tune system performance over time; plan for scalability ahead of time; assess the impact of maintenance operations and infrastructure upgrades; diagnose past, on-going, or re-occurring problems. The Logging Service is based on Oracle DBMS and Application Servers, and Java technology, and is comprised of several layered and multi-tiered systems. These systems have all been heavily instrumented to capture data about system usage, using technologies such as JMX. The success of the Logging Service and its proven ability to cope with ever growing demands can be directly linked to the instrumentation in place. This paper describes the instrumentation that has been developed, and demonstrates how the instrumentation data is used to achieve the goals outlined above.  
slides icon Slides THCHAUST06 [5.459 MB]