ICALEPCS2011 - List of Authors (Haen, C.)

Paper

Title

Page

LHCb Online Infrastructure Monitoring Tools

618

L.G. Cardoso, C. Gaspar, C. Haen, N. Neufeld, F. Varela
CERN, Geneva, Switzerland
D. Galli
INFN-Bologna, Bologna, Italy

The Online System of the LHCb experiment at CERN is composed of a very large number of PCs: around 1500 in a CPU farm for performing the High Level Trigger; around 170 for the control system, running the SCADA system - PVSS; and several others for performing data monitoring, reconstruction, storage, and infrastructure tasks, like databases, etc. Some PCs run Linux, some run Windows but all of them need to be remotely controlled and monitored to make sure they are correctly running and to be able, for example, to reboot them whenever necessary. A set of tools was developed in order to centrally monitor the status of all PCs and PVSS Projects needed to run the experiment: a Farm Monitoring and Control (FMC) tool, which provides the lower level access to the PCs, and a System Overview Tool (developed within the Joint Controls Project – JCOP), which provides a centralized interface to the FMC tool and adds PVSS project monitoring and control. The implementation of these tools has provided a reliable and efficient way to manage the system, both during normal operations but also during shutdowns, upgrades or maintenance operations. This paper will present the particular implementation of this tool in the LHCb experiment and the benefits of its usage in a large scale heterogeneous system.

Slides WEBHAUST01 [3.211 MB]

WEPMU035

Distributed Monitoring System Based on ICINGA

1149

C. Haen, E. Bonaccorsi, N. Neufeld
CERN, Geneva, Switzerland

The basic services of the large IT infrastructure of the LHCb experiment are monitored with ICINGA, a fork of the industry standard monitoring software NAGIOS. The infrastructure includes thousands of servers and computers, storage devices, more than 200 network devices and many VLANS, databases, hundreds diskless nodes and many more. The amount of configuration files needed to control the whole installation is big, and there is a lot of duplication, when the monitoring infrastructure is distributed over several servers. In order to ease the manipulation of the configuration files, we designed a monitoring schema particularly adapted to our network and taking advantage of its specificities, and developed a tool to centralize its configuration in a database. Thanks to this tool, we could also parse all our previous configuration files, and thus fill in our Oracle database, that comes as a replacement of the previous Active Directory based solution. A web frontend allows non-expert users to easily add new entities to monitor. We present the schema of our monitoring infrastructure and the tool used to manage and automatically generate the configuration for ICINGA.

Poster WEPMU035 [0.375 MB]

Paper	Title	Page
WEBHAUST01	LHCb Online Infrastructure Monitoring Tools	618
	L.G. Cardoso, C. Gaspar, C. Haen, N. Neufeld, F. Varela CERN, Geneva, Switzerland D. Galli INFN-Bologna, Bologna, Italy
	The Online System of the LHCb experiment at CERN is composed of a very large number of PCs: around 1500 in a CPU farm for performing the High Level Trigger; around 170 for the control system, running the SCADA system - PVSS; and several others for performing data monitoring, reconstruction, storage, and infrastructure tasks, like databases, etc. Some PCs run Linux, some run Windows but all of them need to be remotely controlled and monitored to make sure they are correctly running and to be able, for example, to reboot them whenever necessary. A set of tools was developed in order to centrally monitor the status of all PCs and PVSS Projects needed to run the experiment: a Farm Monitoring and Control (FMC) tool, which provides the lower level access to the PCs, and a System Overview Tool (developed within the Joint Controls Project – JCOP), which provides a centralized interface to the FMC tool and adds PVSS project monitoring and control. The implementation of these tools has provided a reliable and efficient way to manage the system, both during normal operations but also during shutdowns, upgrades or maintenance operations. This paper will present the particular implementation of this tool in the LHCb experiment and the benefits of its usage in a large scale heterogeneous system.
	Slides WEBHAUST01 [3.211 MB]

WEPMU035	Distributed Monitoring System Based on ICINGA	1149
	C. Haen, E. Bonaccorsi, N. Neufeld CERN, Geneva, Switzerland
	The basic services of the large IT infrastructure of the LHCb experiment are monitored with ICINGA, a fork of the industry standard monitoring software NAGIOS. The infrastructure includes thousands of servers and computers, storage devices, more than 200 network devices and many VLANS, databases, hundreds diskless nodes and many more. The amount of configuration files needed to control the whole installation is big, and there is a lot of duplication, when the monitoring infrastructure is distributed over several servers. In order to ease the manipulation of the configuration files, we designed a monitoring schema particularly adapted to our network and taking advantage of its specificities, and developed a tool to centralize its configuration in a database. Thanks to this tool, we could also parse all our previous configuration files, and thus fill in our Oracle database, that comes as a replacement of the previous Active Directory based solution. A web frontend allows non-expert users to easily add new entities to monitor. We present the schema of our monitoring infrastructure and the tool used to manage and automatically generate the configuration for ICINGA.
	Poster WEPMU035 [0.375 MB]