ICALEPCS2011 - Classification: Infrastructure management and diagnostics

Paper

Title

Page

LHCb Online Infrastructure Monitoring Tools

618

L.G. Cardoso, C. Gaspar, C. Haen, N. Neufeld, F. Varela
CERN, Geneva, Switzerland
D. Galli
INFN-Bologna, Bologna, Italy

The Online System of the LHCb experiment at CERN is composed of a very large number of PCs: around 1500 in a CPU farm for performing the High Level Trigger; around 170 for the control system, running the SCADA system - PVSS; and several others for performing data monitoring, reconstruction, storage, and infrastructure tasks, like databases, etc. Some PCs run Linux, some run Windows but all of them need to be remotely controlled and monitored to make sure they are correctly running and to be able, for example, to reboot them whenever necessary. A set of tools was developed in order to centrally monitor the status of all PCs and PVSS Projects needed to run the experiment: a Farm Monitoring and Control (FMC) tool, which provides the lower level access to the PCs, and a System Overview Tool (developed within the Joint Controls Project – JCOP), which provides a centralized interface to the FMC tool and adds PVSS project monitoring and control. The implementation of these tools has provided a reliable and efficient way to manage the system, both during normal operations but also during shutdowns, upgrades or maintenance operations. This paper will present the particular implementation of this tool in the LHCb experiment and the benefits of its usage in a large scale heterogeneous system.

Slides WEBHAUST01 [3.211 MB]

WEBHAUST02

Optimizing Infrastructure for Software Testing Using Virtualization

622

O. Khalid, B. Copy, A A. Shaikh
CERN, Geneva, Switzerland

Virtualization technology and cloud computing have a brought a paradigm shift in the way we utilize, deploy and manage computer resources. They allow fast deployment of multiple operating system as containers on physical machines which can be either discarded after use or snapshot for later re-deployment. At CERN, we have been using virtualization/cloud computing to quickly setup virtual machines for our developers with pre-configured software to enable them test/deploy a new version of a software patch for a given application. We also have been using the infrastructure to do security analysis of control systems as virtualization provides a degree of isolation where control systems such as SCADA systems could be evaluated for simulated network attacks. This paper reports both on the techniques that have been used for security analysis involving network configuration/isolation to prevent interference of other systems on the network. This paper also provides an overview of the technologies used to deploy such an infrastructure based on VMWare and OpenNebula cloud management platform.

Slides WEBHAUST02 [2.899 MB]

WEBHAUST03

Large-bandwidth Data Acquisition Network for XFEL Facility, SACLA

626

T. Sugimoto, Y. Joti, T. Ohata, R. Tanaka, M. Yamaga
JASRI/SPring-8, Hyogo-ken, Japan
T. Hatsui
RIKEN/SPring-8, Hyogo, Japan

We have developed a large-bandwidth data acquisition (DAQ) network for user experiments at the SPring-8 Angstrom Compact Free Electron Laser (SACLA) facility. The network connects detectors, on-line visualization terminals and a high-speed storage of the control and DAQ system to transfer beam diagnostic data of each X-ray pulse as well as the experimental data. The development of DAQ network system (DAQ-LAN) was one of the critical elements in the system development because the data with transfer rate reaching 5 Gbps should be stored and visualized with high availability. DAQ-LAN is also used for instrument control. In order to guarantee the operation of both the high-speed data transfer and instrument control, we have implemented physical and logical network system. The DAQ-LAN currently consists of six 10-GbE capable network switches exclusively used for the data transfer, and ten 1-GbE capable network switches for instrument control and on-line visualization. High-availability was achieved by link aggregation (LAG) with typical convergence time of 500 ms, which is faster than RSTP (2 sec.). To prevent network trouble caused by broadcast, DAQ-LAN is logically separated into twelve network segments. Logical network segmentation are based on DAQ applications such as data transfer, on-line visualization, and instrument control. The DAQ-LAN will connect the control and DAQ system to the on-site high performance computing system, and to the next-generation super computers in Japan including K-computer for instant data mining during the beamtime, and post analysis.

Slides WEBHAUST03 [5.795 MB]

WEBHAUST04

A Virtualized Computing Platform For Fusion Control Systems

T.M. Frazier, P. Adams, J.M. Fisher, A.J. Talbot
LLNL, Livermore, California, USA

Funding: This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
The National Ignition Facility (NIF) at the Lawrence Livermore National Laboratory is a stadium-sized facility that contains a 192-beam, 1.8-Megajoule, 500-Terawatt, UV laser system together with a 10-meter diameter target chamber with room for multiple experimental diagnostics. NIF is the world's largest and most energetic laser experimental system, providing a scientific center to study inertial confinement fusion (ICF) and matter at extreme energy densities and pressures. NIF's laser beams are designed to compress fusion targets to conditions required for thermonuclear burn, liberating more energy than required to initiate the fusion reactions. 2,500 servers, 400 network devices and 700 terabytes of networked attached storage provide the foundation for NIF's Integrated Computer Control System (ICCS) and Experimental Data Archive. This talk discusses the rationale & benefits for server virtualization in the context of an operational experimental facility, the requirements discovery process used by the NIF teams to establish evaluation criteria for virtualization alternatives, the processes and procedures defined to enable virtualization of servers in a timeframe that did not delay the execution of experimental campaigns and the lessons the NIF teams learned along the way. The virtualization architecture ultimately selected for ICCS is based on the Open Source Xen computing platform and 802.1Q open networking standards. The specific server and network configurations needed to ensure performance and high availability of the control system infrastructure will be discussed.
LLNL-CONF-477653

Slides WEBHAUST04 [2.201 MB]

WEBHAUST05

Distributed System and Network Performance Monitoring

R. Petkus
BNL, Upton, Long Island, New York, USA

A robust and reliable network and system infrastructure is vital for successful operations at the NSLS-II (National Synchrotron Light Source). A key component is a monitoring solution that can provide system information in real-time for fault-detection and problem isolation. Furthermore, this information must be archived for historical trending and post-mortem analysis. With 200+ network switches and dozens of servers comprising our control system, what tools should be selected to monitor system vitals, visualize network utilization, and parse copious Syslog files? How can we track latency on a large network and decompose traffic flows to better optimize configuration? This work will examine both open-source and proprietary tools utilized in the controls group for distributed monitoring such as Splunk, Nagios, SNMP, sFlow, Brocade Network Advisor, and the perfSONAR Performance Toolkit. We will also describe how these elements are integrated into a cohesive platform.

Slides WEBHAUST05 [1.346 MB]

WEBHAUST06

Virtualized High Performance Computing Infrastructure of Novosibirsk Scientific Center

630

A. Zaytsev, S. Belov, V.I. Kaplin, A. Sukharev
BINP SB RAS, Novosibirsk, Russia
A.S. Adakin, D. Chubarov, V. Nikultsev
ICT SB RAS, Novosibirsk, Russia
V. Kalyuzhny
NSU, Novosibirsk, Russia
N. Kuchin, S. Lomakin
ICM&MG SB RAS, Novosibirsk, Russia

Novosibirsk Scientific Center (NSC), also known worldwide as Akademgorodok, is one of the largest Russian scientific centers hosting Novosibirsk State University (NSU) and more than 35 research organizations of the Siberian Branch of Russian Academy of Sciences including Budker Institute of Nuclear Physics (BINP), Institute of Computational Technologies, and Institute of Computational Mathematics and Mathematical Geophysics (ICM&MG). Since each institute has specific requirements on the architecture of computing farms involved in its research field, currently we've got several computing facilities hosted by NSC institutes, each optimized for the particular set of tasks, of which the largest are the NSU Supercomputer Center, Siberian Supercomputer Center (ICM&MG), and a Grid Computing Facility of BINP. A dedicated optical network with the initial bandwidth of 10 Gbps connecting these three facilities was built in order to make it possible to share the computing resources among the research communities, thus increasing the efficiency of operating the existing computing facilities and offering a common platform for building the computing infrastructure for future scientific projects. Unification of the computing infrastructure is achieved by extensive use of virtualization technology based on XEN and KVM platforms. Our contribution gives a thorough review of the present status and future development prospects for the NSC virtualized computing infrastructure focusing on its applications for handling everyday data processing tasks of HEP experiments being carried out at BINP.

Slides WEBHAUST06 [14.369 MB]

WEMMU005

Fabric Management with Diskless Servers and Quattor on LHCb

691

P. Schweitzer, E. Bonaccorsi, L. Brarda, N. Neufeld
CERN, Geneva, Switzerland

Large scientific experiments nowadays very often are using large computer farms to process the events acquired from the detectors. In LHCb a small sysadmin team manages 1400 servers of the LHCb Event Filter Farm, but also a wide variety of control servers for the detector electronics and infrastructure computers : file servers, gateways, DNS, DHCP and others. This variety of servers could not be handled without a solid fabric management system. We choose the Quattor toolkit for this task. We will present our use of this toolkit, with an emphasis on how we handle our diskless nodes (Event filter farm nodes and computers embedded in the acquisition electronic cards). We will show our current tests to replace the standard (RedHat/Scientific Linux) way of handling diskless nodes to fusion filesystems and how it improves fabric management.

Slides WEMMU005 [0.119 MB]

Poster WEMMU005 [0.602 MB]

WEMMU006

Management Tools for Distributed Control System in KSTAR

694

S. Lee, J.S. Hong, J.S. Park, M.K. Park, S.W. Yun
NFRI, Daejon, Republic of Korea

The integrated control system of the Korea Superconducting Tokamak Advanced Research (KSTAR) has been developed with distributed control systems based on Experimental Physics and Industrial Control System (EPICS). It has the essential role of remote operation, supervising of tokamak device and conducting of plasma experiments without any interruption. Therefore, the availability of the control system directly impacts on the entire device performance. For the non-interrupted operation of the KSTAR control system, we have developed a tool named as Control System Monitoring (CSM) to monitor the resources of EPICS Input/Output Controller (IOC) servers (utilization of memory, cpu, disk, network, user-defined process and system-defined process), the soundness of storage systems (storage utilization, storage status), the status of network switches using Simple Network Management Protocol (SNMP), the network connection status of every local control sever using Internet Control Message Protocol (ICMP), and the operation environment of the main control room and the computer room (temperature, humidity, water-leak) in real time. When abnormal conditions or faults are detected by the CSM, it alerts abnormal or fault alarms to operators. Especially, if critical fault related to the data storage occurs, the CSM sends the simple messages to operator’s mobile phone. In addition to the CSM, other tools, which are subversion for software version control and vmware for the virtualized IT infrastructure, for managing the integrated control system for KSTAR operation will be introduced.

Slides WEMMU006 [0.247 MB]

Poster WEMMU006 [5.611 MB]

WEPMU030

CERN Safety System Monitoring - SSM

1134

T. Hakulinen, P. Ninin, F. Valentini
CERN, Geneva, Switzerland
J. Gonzalez, C. Salatko-Petryszcze
ASsystem, St Genis Pouilly, France

CERN SSM (Safety System Monitoring) is a system for monitoring state-of-health of the various access and safety systems of the CERN site and accelerator infrastructure. The emphasis of SSM is on the needs of maintenance and system operation with the aim of providing an independent and reliable verification path of the basic operational parameters of each system. Included are all network-connected devices, such as PLCs, servers, panel displays, operator posts, etc. The basic monitoring engine of SSM is a freely available system monitoring framework Zabbix, on top of which a simplified traffic-light-type web-interface has been built. The web-interface of SSM is designed to be ultra-light to facilitate access from handheld devices over slow connections. The underlying Zabbix system offers history and notification mechanisms typical advanced monitoring systems.

Poster WEPMU030 [1.231 MB]

WEPMU031

Virtualization in Control System Environment

1138

L.R. Shen, D.K. Liu, T. Wan
SINAP, Shanghai, People's Republic of China

In a large scale distribute control system, there are lots of common services composing an environment of the entire control system, such as the server system for the common software base library, application server, archive server and so on. This paper gives a description of a virtualization realization for a control system environment, including the virtualization for server, storage, network system and application for the control system. With a virtualization instance of the epics based control system environment built by the VMware vSphere v4, we tested the whole functionality of this virtualization environment in the SSRF control system, including the common server of the NFS, NIS, NTP, Boot and EPICS base and extension library tools, we also carried out virtualization of the application server such as the Archive, Alarm, EPICS gateway and all of the network based IOC. Specially, we tested the high availability (HA) and VMotion for EPICS asynchronous IOC successfully under the different VLAN configuration of the current SSRF control system network.

WEPMU033

Monitoring Control Applications at CERN

1141

F. Varela, F.B. Bernard, M. Gonzalez-Berges, H. Milcent, L.B. Petrova
CERN, Geneva, Switzerland

The Industrial Controls and Engineering (EN-ICE) group of the Engineering Department at CERN has produced, and is responsible for the operation of around 60 applications, which control critical processes in the domains of cryogenics, quench protections systems, power interlocks for the Large Hadron Collider and other sub-systems of the accelerator complex. These applications require 24/7 operation and a quick reaction to problems. For this reason the EN-ICE is presently developing the monitoring tool to detect, anticipate and inform of possible anomalies in the integrity of the applications. The tool builds on top of Simatic WinCC Open Architecture (formerly PVSS) SCADA and makes usage of the Joint COntrols Project (JCOP) and UNICOS Frameworks developed at CERN. The tool provides centralized monitoring of the different elements integrating the controls systems like Windows and Linux servers, PLCs, applications, etc. Although the primary aim of the tool is to assist the members of the EN-ICE Standby Service, the tool may present different levels of details of the systems depending on the user, which enables experts to diagnose and troubleshoot problems. In this paper, the scope, functionality and architecture of the tool are presented and some initial results on its performance are summarized.

Poster WEPMU033 [1.719 MB]

WEPMU034

Infrastructure of Taiwan Photon Source Control Network

1145

Y.-T. Chang, J. Chen, Y.-S. Cheng, K.T. Hsu, S.Y. Hsu, K.H. Hu, C.H. Kuo, C.Y. Wu
NSRRC, Hsinchu, Taiwan

A reliable, flexible and secure network is essential for the Taiwan Photon Source (TPS) control system which is based upon the EPICS toolkit framework. Subsystem subnets will connect to control system via EPICS based CA gateways for forwarding data and reducing network traffic. Combining cyber security technologies such as firewall, NAT and VLAN, control network is isolated to protect IOCs and accelerator components. Network management tools are used to improve network performance. Remote access mechanism will be constructed for maintenance and troubleshooting. The Ethernet is also used as fieldbus for instruments such as power supplies. This paper will describe the system architecture for the TPS control network. Cabling topology, redundancy and maintainability are also discussed.

WEPMU035

Distributed Monitoring System Based on ICINGA

1149

C. Haen, E. Bonaccorsi, N. Neufeld
CERN, Geneva, Switzerland

The basic services of the large IT infrastructure of the LHCb experiment are monitored with ICINGA, a fork of the industry standard monitoring software NAGIOS. The infrastructure includes thousands of servers and computers, storage devices, more than 200 network devices and many VLANS, databases, hundreds diskless nodes and many more. The amount of configuration files needed to control the whole installation is big, and there is a lot of duplication, when the monitoring infrastructure is distributed over several servers. In order to ease the manipulation of the configuration files, we designed a monitoring schema particularly adapted to our network and taking advantage of its specificities, and developed a tool to centralize its configuration in a database. Thanks to this tool, we could also parse all our previous configuration files, and thus fill in our Oracle database, that comes as a replacement of the previous Active Directory based solution. A web frontend allows non-expert users to easily add new entities to monitor. We present the schema of our monitoring infrastructure and the tool used to manage and automatically generate the configuration for ICINGA.

Poster WEPMU035 [0.375 MB]

WEPMU036

Efficient Network Monitoring for Large Data Acquisition Systems

1153

D.O. Savu, B. Martin
CERN, Geneva, Switzerland
A. Al-Shabibi
Heidelberg University, Heidelberg, Germany
S.M. Batraneanu, S.N. Stancu
UCI, Irvine, California, USA
R. Sjoen
University of Oslo, Oslo, Norway

Though constantly evolving and improving, the available network monitoring solutions have limitations when applied to the infrastructure of a high speed real-time data acquisition (DAQ) system. DAQ networks are particular computer networks where experts have to pay attention to both individual subsections as well as system wide traffic flows while monitoring the network. The ATLAS Network at the Large Hadron Collider (LHC) has more than 200 switches interconnecting 3500 hosts and totaling 8500 high speed links. The use of heterogeneous tools for monitoring various infrastructure parameters, in order to assure optimal DAQ system performance, proved to be a tedious and time consuming task for experts. To alleviate this problem we used our networking and DAQ expertise to build a flexible and scalable monitoring system providing an intuitive user interface with the same look and feel irrespective of the data provider that is used. Our system uses custom developed components for critical performance monitoring and seamlessly integrates complementary data from auxiliary tools, such as NAGIOS, information services or custom databases. A number of techniques (e.g. normalization, aggregation and data caching) were used in order to improve the user interface response time. The end result is a unified monitoring interface, for fast and uniform access to system statistics, which significantly reduced the time spent by experts for ad-hoc and post-mortem analysis.

Poster WEPMU036 [5.945 MB]

WEPMU037

Virtualization for the LHCb Experiment

1157

E. Bonaccorsi, L. Brarda, M. Chebbi, N. Neufeld
CERN, Geneva, Switzerland
F. Sborzacchi
INFN/LNF, Frascati (Roma), Italy

The LHCb Experiment, one of the four large particle physics detectors at CERN, counts in its Online System more than 2000 servers and embedded systems. As a result of ever-increasing CPU performance in modern servers, many of the applications in the controls system are excellent candidates for virtualization technologies. We see virtualization as an approach to cut down cost, optimize resource usage and manage the complexity of the IT infrastructure of LHCb. Recently we have added a Kernel Virtual Machine (KVM) cluster based on Red Hat Enterprise Virtualization for Servers (RHEV) complementary to the existing Hyper-V cluster devoted only to the virtualization of the windows guests. This paper describes the architecture of our solution based on KVM and RHEV as along with its integration with the existing Hyper-V infrastructure and the Quattor cluster management tools and in particular how we use to run controls applications on a virtualized infrastructure. We present performance results of both the KVM and Hyper-V solutions, problems encountered and a description of the management tools developed for the integration with the Online cluster and LHCb SCADA control system based on PVSS.

WEPMU038

Network Security System and Method for RIBF Control System

1161

A. Uchiyama
SHI Accelerator Service Ltd., Tokyo, Japan
M. Fujimaki, N. Fukunishi, M. Komiyama, R. Koyama
RIKEN Nishina Center, Wako, Japan

In RIKEN RI beam factory (RIBF), the local area network for accelerator control system (control system network) consists of commercially produced Ethernet switches, optical fibers and metal cables. On the other hand, E-mail and Internet access for unrelated task to accelerator operation are usually used in RIKEN virtual LAN (VLAN) as office network. From the viewpoint of information security, we decided to separate the control system network from the Internet and operate it independently from VLAN. However, it was inconvenient for users for the following reason; it was unable to monitor the information and status of accelerator operation from the user's office in a real time fashion. To improve this situation, we have constructed a secure system which allows the users to get the accelerator information from VLAN to control system network, while preventing outsiders from having access to the information. To allow access to inside control system network over the network from VLAN, we constructed reverse proxy server and firewall. In addition, we implement a system to send E-mail as security alert from control system network to VLAN. In our contribution, we report this system and the present status in detail.

Poster WEPMU038 [45.776 MB]

WEPMU039

Virtual IO Controllers at J-PARC MR using Xen

1165

N. Kamikubota, N. Yamamoto
J-PARC, KEK & JAEA, Ibaraki-ken, Japan
T. Iitsuka, S. Motohashi, M. Takagi, S.Y. Yoshida
Kanto Information Service (KIS), Accelerator Group, Ibaraki, Japan
H. Nemoto
ACMOS INC., Tokai-mura, Ibaraki, Japan
S. Yamada
KEK, Ibaraki, Japan

The control system for J-PARC accelerator complex has been developed based on the EPICS toolkit. About 100 traditional ("real") VME-bus computers are used as EPICS IOCs in the control system for J-PARC MR (Main Ring). Recently, we have introduced "virtual" IOCs using Xen, an open-source virtual machine monitor. Scientific Linux with an EPICS iocCore runs on a Xen virtual machine. EPICS databases for network devices and EPICS soft records can be configured. Multiple virtual IOCs run on a high performance blade-type server, running Scientific Linux as native OS. A few number of virtual IOCs have been demonstrated in MR operation since October, 2010. Experience and future perspective will be discussed.

WEPMU040

Packaging of Control System Software

1168

K. Žagar, M. Kobal, N. Saje, A. Žagar
Cosylab, Ljubljana, Slovenia
F. Di Maio, D. Stepanov
ITER Organization, St. Paul lez Durance, France
R. Šabjan
COBIK, Solkan, Slovenia

Funding: ITER European Union, European Regional Development Fund and Republic of Slovenia, Ministry of Higher Education, Science and Technology
Control system software consists of several parts – the core of the control system, drivers for integration of devices, configuration for user interfaces, alarm system, etc. Once the software is developed and configured, it must be installed to computers where it runs. Usually, it is installed on an operating system whose services it needs, and also in some cases dynamically links with the libraries it provides. Operating system can be quite complex itself – for example, a typical Linux distribution consists of several thousand packages. To manage this complexity, we have decided to rely on Red Hat Package Management system (RPM) to package control system software, and also ensure it is properly installed (i.e., that dependencies are also installed, and that scripts are run after installation if any additional actions need to be performed). As dozens of RPM packages need to be prepared, we are reducing the amount of effort and improving consistency between packages through a Maven-based infrastructure that assists in packaging (e.g., automated generation of RPM SPEC files, including automated identification of dependencies). So far, we have used it to package EPICS, Control System Studio (CSS) and several device drivers. We perform extensive testing on Red Hat Enterprise Linux 5.5, but we have also verified that packaging works on CentOS and Scientific Linux. In this article, we describe in greater detail the systematic system of packaging we are using, and its particular application for the ITER CODAC Core System.

Poster WEPMU040 [0.740 MB]