ICALEPCS2011 - List of Authors (Neufeld, N.)

Paper

Title

Page

The LHCb Experiment Control System: on the Path to Full Automation

20

C. Gaspar, F. Alessio, L.G. Cardoso, M. Frank, J.C. Garnier, R. Jacobsson, B. Jost, N. Neufeld, R. Schwemmer, E. van Herwijnen
CERN, Geneva, Switzerland
O. Callot
LAL, Orsay, France
B. Franek
STFC/RAL, Chilton, Didcot, Oxon, United Kingdom

LHCb is a large experiment at the LHC accelerator. The experiment control system is in charge of the configuration, control and monitoring of the different sub-detectors and of all areas of the online system: the Detector Control System (DCS), sub-detector's voltages, cooling, temperatures, etc.; the Data Acquisition System (DAQ), and the Run-Control; the High Level Trigger (HLT), a farm of around 1500 PCs running trigger algorithms; etc. The building blocks of the control system are based on the PVSS SCADA System complemented by a control Framework developed in common for the 4 LHC experiments. This framework includes an "expert system" like tool called SMI++ which we use for the system automation. The full control system runs distributed over around 160 PCs and is logically organised in a hierarchical structure, each level being capable of supervising and synchronizing the objects below. The experiment's operations are now almost completely automated driven by a top-level object called Big-Brother which pilots all the experiment's standard procedures and the most common error-recovery procedures. Some examples of automated procedures are: powering the detector, acting on the Run-Control (Start/Stop Run, etc.) and moving the vertex detector in/out of the beam, all driven by the state of the accelerator or recovering from errors in the HLT farm. The architecture, tools and mechanisms used for the implementation as well as some operational examples will be shown.

Slides MOBAUST06 [1.451 MB]

MOPMN019

Controling and Monitoring the Data Flow of the LHCb Read-out and DAQ Network

281

R. Schwemmer, C. Gaspar, N. Neufeld, D. Svantesson
CERN, Geneva, Switzerland

The LHCb readout uses a set of 320 FPGA based boards as interface between the on-detector hardware and the GBE DAQ network. The boards are the logical Level 1 (L1) read-out electronics and aggregate the experiment's raw data into event fragments that are sent to the DAQ network. To control the many parameters of the read-out boards, an embedded PC is included on each board, connecting to the boards ICs and FPGAs. The data from the L1 boards is sent through an aggregation network into the High Level Trigger farm. The farm comprises approximately 1500 PCs which at first assemble the fragments from the L1 boards and then do a partial reconstruction and selection of the events. In total there are approximately 3500 network connections. Data is pushed through the network and there is no mechanism for resending packets. Loss of data on a small scale is acceptable but care has to be taken to avoid data loss if possible. To monitor and debug losses, different probes are inserted throughout the entire read-out chain to count fragments, packets and their rates at different positions. To keep uniformity throughout the experiment, all control software was developed using the common SCADA software, PVSS, with the JCOP framework as base. The presentation will focus on the low level controls interface developed for the L1 boards and the networking probes, as well as the integration of the high level user interfaces into PVSS. We will show the way in which the users and developers interact with the software, configure the hardware and follow the flow of data through the DAQ network.

WEBHAUST01

LHCb Online Infrastructure Monitoring Tools

618

L.G. Cardoso, C. Gaspar, C. Haen, N. Neufeld, F. Varela
CERN, Geneva, Switzerland
D. Galli
INFN-Bologna, Bologna, Italy

The Online System of the LHCb experiment at CERN is composed of a very large number of PCs: around 1500 in a CPU farm for performing the High Level Trigger; around 170 for the control system, running the SCADA system - PVSS; and several others for performing data monitoring, reconstruction, storage, and infrastructure tasks, like databases, etc. Some PCs run Linux, some run Windows but all of them need to be remotely controlled and monitored to make sure they are correctly running and to be able, for example, to reboot them whenever necessary. A set of tools was developed in order to centrally monitor the status of all PCs and PVSS Projects needed to run the experiment: a Farm Monitoring and Control (FMC) tool, which provides the lower level access to the PCs, and a System Overview Tool (developed within the Joint Controls Project – JCOP), which provides a centralized interface to the FMC tool and adds PVSS project monitoring and control. The implementation of these tools has provided a reliable and efficient way to manage the system, both during normal operations but also during shutdowns, upgrades or maintenance operations. This paper will present the particular implementation of this tool in the LHCb experiment and the benefits of its usage in a large scale heterogeneous system.

Slides WEBHAUST01 [3.211 MB]

WEMMU005

Fabric Management with Diskless Servers and Quattor on LHCb

691

P. Schweitzer, E. Bonaccorsi, L. Brarda, N. Neufeld
CERN, Geneva, Switzerland

Large scientific experiments nowadays very often are using large computer farms to process the events acquired from the detectors. In LHCb a small sysadmin team manages 1400 servers of the LHCb Event Filter Farm, but also a wide variety of control servers for the detector electronics and infrastructure computers : file servers, gateways, DNS, DHCP and others. This variety of servers could not be handled without a solid fabric management system. We choose the Quattor toolkit for this task. We will present our use of this toolkit, with an emphasis on how we handle our diskless nodes (Event filter farm nodes and computers embedded in the acquisition electronic cards). We will show our current tests to replace the standard (RedHat/Scientific Linux) way of handling diskless nodes to fusion filesystems and how it improves fabric management.

Slides WEMMU005 [0.119 MB]

Poster WEMMU005 [0.602 MB]

WEPMU035

Distributed Monitoring System Based on ICINGA

1149

C. Haen, E. Bonaccorsi, N. Neufeld
CERN, Geneva, Switzerland

The basic services of the large IT infrastructure of the LHCb experiment are monitored with ICINGA, a fork of the industry standard monitoring software NAGIOS. The infrastructure includes thousands of servers and computers, storage devices, more than 200 network devices and many VLANS, databases, hundreds diskless nodes and many more. The amount of configuration files needed to control the whole installation is big, and there is a lot of duplication, when the monitoring infrastructure is distributed over several servers. In order to ease the manipulation of the configuration files, we designed a monitoring schema particularly adapted to our network and taking advantage of its specificities, and developed a tool to centralize its configuration in a database. Thanks to this tool, we could also parse all our previous configuration files, and thus fill in our Oracle database, that comes as a replacement of the previous Active Directory based solution. A web frontend allows non-expert users to easily add new entities to monitor. We present the schema of our monitoring infrastructure and the tool used to manage and automatically generate the configuration for ICINGA.

Poster WEPMU035 [0.375 MB]

WEPMU037

Virtualization for the LHCb Experiment

1157

E. Bonaccorsi, L. Brarda, M. Chebbi, N. Neufeld
CERN, Geneva, Switzerland
F. Sborzacchi
INFN/LNF, Frascati (Roma), Italy

The LHCb Experiment, one of the four large particle physics detectors at CERN, counts in its Online System more than 2000 servers and embedded systems. As a result of ever-increasing CPU performance in modern servers, many of the applications in the controls system are excellent candidates for virtualization technologies. We see virtualization as an approach to cut down cost, optimize resource usage and manage the complexity of the IT infrastructure of LHCb. Recently we have added a Kernel Virtual Machine (KVM) cluster based on Red Hat Enterprise Virtualization for Servers (RHEV) complementary to the existing Hyper-V cluster devoted only to the virtualization of the windows guests. This paper describes the architecture of our solution based on KVM and RHEV as along with its integration with the existing Hyper-V infrastructure and the Quattor cluster management tools and in particular how we use to run controls applications on a virtualized infrastructure. We present performance results of both the KVM and Hyper-V solutions, problems encountered and a description of the management tools developed for the integration with the Online cluster and LHCb SCADA control system based on PVSS.

THCHAUST05

LHCb Online Log Analysis and Maintenance System

1228

J.C. Garnier, L. Brarda, N. Neufeld, F. Nikolaidis
CERN, Geneva, Switzerland

History has shown, many times computer logs are the only information an administrator may have for an incident, which could be caused either by a malfunction or an attack. Due to huge amount of logs that are produced from large-scale IT infrastructures, such as LHCb Online, critical information may overlooked or simply be drowned in a sea of other messages . This clearly demonstrates the need for an automatic system for long-term maintenance and real time analysis of the logs. We have constructed a low cost, fault tolerant centralized logging system which is able to do in-depth analysis and cross-correlation of every log. This system is capable of handling O(10000) different log sources and numerous formats, while trying to keep the overhead as low as possible. It provides log gathering and management, offline analysis and online analysis. We call offline analysis the procedure of analyzing old logs for critical information, while Online analysis refer to the procedure of early alerting and reacting. The system is extensible and cooperates well with other applications such as Intrusion Detection / Prevention Systems. This paper presents the LHCb Online topology, problems we had to overcome and our solutions. Special emphasis is given to log analysis and how we use it for monitoring and how we can have uninterrupted access to the logs. We provide performance plots, code modification in well known log tools and our experience from trying various storage strategies.

Slides THCHAUST05 [0.377 MB]

Paper	Title	Page
MOBAUST06	The LHCb Experiment Control System: on the Path to Full Automation	20
	C. Gaspar, F. Alessio, L.G. Cardoso, M. Frank, J.C. Garnier, R. Jacobsson, B. Jost, N. Neufeld, R. Schwemmer, E. van Herwijnen CERN, Geneva, Switzerland O. Callot LAL, Orsay, France B. Franek STFC/RAL, Chilton, Didcot, Oxon, United Kingdom
	LHCb is a large experiment at the LHC accelerator. The experiment control system is in charge of the configuration, control and monitoring of the different sub-detectors and of all areas of the online system: the Detector Control System (DCS), sub-detector's voltages, cooling, temperatures, etc.; the Data Acquisition System (DAQ), and the Run-Control; the High Level Trigger (HLT), a farm of around 1500 PCs running trigger algorithms; etc. The building blocks of the control system are based on the PVSS SCADA System complemented by a control Framework developed in common for the 4 LHC experiments. This framework includes an "expert system" like tool called SMI++ which we use for the system automation. The full control system runs distributed over around 160 PCs and is logically organised in a hierarchical structure, each level being capable of supervising and synchronizing the objects below. The experiment's operations are now almost completely automated driven by a top-level object called Big-Brother which pilots all the experiment's standard procedures and the most common error-recovery procedures. Some examples of automated procedures are: powering the detector, acting on the Run-Control (Start/Stop Run, etc.) and moving the vertex detector in/out of the beam, all driven by the state of the accelerator or recovering from errors in the HLT farm. The architecture, tools and mechanisms used for the implementation as well as some operational examples will be shown.
	Slides MOBAUST06 [1.451 MB]

MOPMN019	Controling and Monitoring the Data Flow of the LHCb Read-out and DAQ Network	281
	R. Schwemmer, C. Gaspar, N. Neufeld, D. Svantesson CERN, Geneva, Switzerland
	The LHCb readout uses a set of 320 FPGA based boards as interface between the on-detector hardware and the GBE DAQ network. The boards are the logical Level 1 (L1) read-out electronics and aggregate the experiment's raw data into event fragments that are sent to the DAQ network. To control the many parameters of the read-out boards, an embedded PC is included on each board, connecting to the boards ICs and FPGAs. The data from the L1 boards is sent through an aggregation network into the High Level Trigger farm. The farm comprises approximately 1500 PCs which at first assemble the fragments from the L1 boards and then do a partial reconstruction and selection of the events. In total there are approximately 3500 network connections. Data is pushed through the network and there is no mechanism for resending packets. Loss of data on a small scale is acceptable but care has to be taken to avoid data loss if possible. To monitor and debug losses, different probes are inserted throughout the entire read-out chain to count fragments, packets and their rates at different positions. To keep uniformity throughout the experiment, all control software was developed using the common SCADA software, PVSS, with the JCOP framework as base. The presentation will focus on the low level controls interface developed for the L1 boards and the networking probes, as well as the integration of the high level user interfaces into PVSS. We will show the way in which the users and developers interact with the software, configure the hardware and follow the flow of data through the DAQ network.

WEBHAUST01	LHCb Online Infrastructure Monitoring Tools	618
	L.G. Cardoso, C. Gaspar, C. Haen, N. Neufeld, F. Varela CERN, Geneva, Switzerland D. Galli INFN-Bologna, Bologna, Italy
	The Online System of the LHCb experiment at CERN is composed of a very large number of PCs: around 1500 in a CPU farm for performing the High Level Trigger; around 170 for the control system, running the SCADA system - PVSS; and several others for performing data monitoring, reconstruction, storage, and infrastructure tasks, like databases, etc. Some PCs run Linux, some run Windows but all of them need to be remotely controlled and monitored to make sure they are correctly running and to be able, for example, to reboot them whenever necessary. A set of tools was developed in order to centrally monitor the status of all PCs and PVSS Projects needed to run the experiment: a Farm Monitoring and Control (FMC) tool, which provides the lower level access to the PCs, and a System Overview Tool (developed within the Joint Controls Project – JCOP), which provides a centralized interface to the FMC tool and adds PVSS project monitoring and control. The implementation of these tools has provided a reliable and efficient way to manage the system, both during normal operations but also during shutdowns, upgrades or maintenance operations. This paper will present the particular implementation of this tool in the LHCb experiment and the benefits of its usage in a large scale heterogeneous system.
	Slides WEBHAUST01 [3.211 MB]

WEMMU005	Fabric Management with Diskless Servers and Quattor on LHCb	691
	P. Schweitzer, E. Bonaccorsi, L. Brarda, N. Neufeld CERN, Geneva, Switzerland
	Large scientific experiments nowadays very often are using large computer farms to process the events acquired from the detectors. In LHCb a small sysadmin team manages 1400 servers of the LHCb Event Filter Farm, but also a wide variety of control servers for the detector electronics and infrastructure computers : file servers, gateways, DNS, DHCP and others. This variety of servers could not be handled without a solid fabric management system. We choose the Quattor toolkit for this task. We will present our use of this toolkit, with an emphasis on how we handle our diskless nodes (Event filter farm nodes and computers embedded in the acquisition electronic cards). We will show our current tests to replace the standard (RedHat/Scientific Linux) way of handling diskless nodes to fusion filesystems and how it improves fabric management.
	Slides WEMMU005 [0.119 MB]
	Poster WEMMU005 [0.602 MB]

WEPMU035	Distributed Monitoring System Based on ICINGA	1149
	C. Haen, E. Bonaccorsi, N. Neufeld CERN, Geneva, Switzerland
	The basic services of the large IT infrastructure of the LHCb experiment are monitored with ICINGA, a fork of the industry standard monitoring software NAGIOS. The infrastructure includes thousands of servers and computers, storage devices, more than 200 network devices and many VLANS, databases, hundreds diskless nodes and many more. The amount of configuration files needed to control the whole installation is big, and there is a lot of duplication, when the monitoring infrastructure is distributed over several servers. In order to ease the manipulation of the configuration files, we designed a monitoring schema particularly adapted to our network and taking advantage of its specificities, and developed a tool to centralize its configuration in a database. Thanks to this tool, we could also parse all our previous configuration files, and thus fill in our Oracle database, that comes as a replacement of the previous Active Directory based solution. A web frontend allows non-expert users to easily add new entities to monitor. We present the schema of our monitoring infrastructure and the tool used to manage and automatically generate the configuration for ICINGA.
	Poster WEPMU035 [0.375 MB]

WEPMU037	Virtualization for the LHCb Experiment	1157
	E. Bonaccorsi, L. Brarda, M. Chebbi, N. Neufeld CERN, Geneva, Switzerland F. Sborzacchi INFN/LNF, Frascati (Roma), Italy
	The LHCb Experiment, one of the four large particle physics detectors at CERN, counts in its Online System more than 2000 servers and embedded systems. As a result of ever-increasing CPU performance in modern servers, many of the applications in the controls system are excellent candidates for virtualization technologies. We see virtualization as an approach to cut down cost, optimize resource usage and manage the complexity of the IT infrastructure of LHCb. Recently we have added a Kernel Virtual Machine (KVM) cluster based on Red Hat Enterprise Virtualization for Servers (RHEV) complementary to the existing Hyper-V cluster devoted only to the virtualization of the windows guests. This paper describes the architecture of our solution based on KVM and RHEV as along with its integration with the existing Hyper-V infrastructure and the Quattor cluster management tools and in particular how we use to run controls applications on a virtualized infrastructure. We present performance results of both the KVM and Hyper-V solutions, problems encountered and a description of the management tools developed for the integration with the Online cluster and LHCb SCADA control system based on PVSS.

THCHAUST05	LHCb Online Log Analysis and Maintenance System	1228
	J.C. Garnier, L. Brarda, N. Neufeld, F. Nikolaidis CERN, Geneva, Switzerland
	History has shown, many times computer logs are the only information an administrator may have for an incident, which could be caused either by a malfunction or an attack. Due to huge amount of logs that are produced from large-scale IT infrastructures, such as LHCb Online, critical information may overlooked or simply be drowned in a sea of other messages . This clearly demonstrates the need for an automatic system for long-term maintenance and real time analysis of the logs. We have constructed a low cost, fault tolerant centralized logging system which is able to do in-depth analysis and cross-correlation of every log. This system is capable of handling O(10000) different log sources and numerous formats, while trying to keep the overhead as low as possible. It provides log gathering and management, offline analysis and online analysis. We call offline analysis the procedure of analyzing old logs for critical information, while Online analysis refer to the procedure of early alerting and reacting. The system is extensible and cooperates well with other applications such as Intrusion Detection / Prevention Systems. This paper presents the LHCb Online topology, problems we had to overcome and our solutions. Special emphasis is given to log analysis and how we use it for monitoring and how we can have uninterrupted access to the logs. We provide performance plots, code modification in well known log tools and our experience from trying various storage strategies.
	Slides THCHAUST05 [0.377 MB]