ICALEPCS2011 - List of Authors (Garnier, J.C.)

Paper

Title

Page

The LHCb Experiment Control System: on the Path to Full Automation

20

C. Gaspar, F. Alessio, L.G. Cardoso, M. Frank, J.C. Garnier, R. Jacobsson, B. Jost, N. Neufeld, R. Schwemmer, E. van Herwijnen
CERN, Geneva, Switzerland
O. Callot
LAL, Orsay, France
B. Franek
STFC/RAL, Chilton, Didcot, Oxon, United Kingdom

LHCb is a large experiment at the LHC accelerator. The experiment control system is in charge of the configuration, control and monitoring of the different sub-detectors and of all areas of the online system: the Detector Control System (DCS), sub-detector's voltages, cooling, temperatures, etc.; the Data Acquisition System (DAQ), and the Run-Control; the High Level Trigger (HLT), a farm of around 1500 PCs running trigger algorithms; etc. The building blocks of the control system are based on the PVSS SCADA System complemented by a control Framework developed in common for the 4 LHC experiments. This framework includes an "expert system" like tool called SMI++ which we use for the system automation. The full control system runs distributed over around 160 PCs and is logically organised in a hierarchical structure, each level being capable of supervising and synchronizing the objects below. The experiment's operations are now almost completely automated driven by a top-level object called Big-Brother which pilots all the experiment's standard procedures and the most common error-recovery procedures. Some examples of automated procedures are: powering the detector, acting on the Run-Control (Start/Stop Run, etc.) and moving the vertex detector in/out of the beam, all driven by the state of the accelerator or recovering from errors in the HLT farm. The architecture, tools and mechanisms used for the implementation as well as some operational examples will be shown.

Slides MOBAUST06 [1.451 MB]

THCHAUST05

LHCb Online Log Analysis and Maintenance System

1228

J.C. Garnier, L. Brarda, N. Neufeld, F. Nikolaidis
CERN, Geneva, Switzerland

History has shown, many times computer logs are the only information an administrator may have for an incident, which could be caused either by a malfunction or an attack. Due to huge amount of logs that are produced from large-scale IT infrastructures, such as LHCb Online, critical information may overlooked or simply be drowned in a sea of other messages . This clearly demonstrates the need for an automatic system for long-term maintenance and real time analysis of the logs. We have constructed a low cost, fault tolerant centralized logging system which is able to do in-depth analysis and cross-correlation of every log. This system is capable of handling O(10000) different log sources and numerous formats, while trying to keep the overhead as low as possible. It provides log gathering and management, offline analysis and online analysis. We call offline analysis the procedure of analyzing old logs for critical information, while Online analysis refer to the procedure of early alerting and reacting. The system is extensible and cooperates well with other applications such as Intrusion Detection / Prevention Systems. This paper presents the LHCb Online topology, problems we had to overcome and our solutions. Special emphasis is given to log analysis and how we use it for monitoring and how we can have uninterrupted access to the logs. We provide performance plots, code modification in well known log tools and our experience from trying various storage strategies.

Slides THCHAUST05 [0.377 MB]

Paper	Title	Page
MOBAUST06	The LHCb Experiment Control System: on the Path to Full Automation	20
	C. Gaspar, F. Alessio, L.G. Cardoso, M. Frank, J.C. Garnier, R. Jacobsson, B. Jost, N. Neufeld, R. Schwemmer, E. van Herwijnen CERN, Geneva, Switzerland O. Callot LAL, Orsay, France B. Franek STFC/RAL, Chilton, Didcot, Oxon, United Kingdom
	LHCb is a large experiment at the LHC accelerator. The experiment control system is in charge of the configuration, control and monitoring of the different sub-detectors and of all areas of the online system: the Detector Control System (DCS), sub-detector's voltages, cooling, temperatures, etc.; the Data Acquisition System (DAQ), and the Run-Control; the High Level Trigger (HLT), a farm of around 1500 PCs running trigger algorithms; etc. The building blocks of the control system are based on the PVSS SCADA System complemented by a control Framework developed in common for the 4 LHC experiments. This framework includes an "expert system" like tool called SMI++ which we use for the system automation. The full control system runs distributed over around 160 PCs and is logically organised in a hierarchical structure, each level being capable of supervising and synchronizing the objects below. The experiment's operations are now almost completely automated driven by a top-level object called Big-Brother which pilots all the experiment's standard procedures and the most common error-recovery procedures. Some examples of automated procedures are: powering the detector, acting on the Run-Control (Start/Stop Run, etc.) and moving the vertex detector in/out of the beam, all driven by the state of the accelerator or recovering from errors in the HLT farm. The architecture, tools and mechanisms used for the implementation as well as some operational examples will be shown.
	Slides MOBAUST06 [1.451 MB]

THCHAUST05	LHCb Online Log Analysis and Maintenance System	1228
	J.C. Garnier, L. Brarda, N. Neufeld, F. Nikolaidis CERN, Geneva, Switzerland
	History has shown, many times computer logs are the only information an administrator may have for an incident, which could be caused either by a malfunction or an attack. Due to huge amount of logs that are produced from large-scale IT infrastructures, such as LHCb Online, critical information may overlooked or simply be drowned in a sea of other messages . This clearly demonstrates the need for an automatic system for long-term maintenance and real time analysis of the logs. We have constructed a low cost, fault tolerant centralized logging system which is able to do in-depth analysis and cross-correlation of every log. This system is capable of handling O(10000) different log sources and numerous formats, while trying to keep the overhead as low as possible. It provides log gathering and management, offline analysis and online analysis. We call offline analysis the procedure of analyzing old logs for critical information, while Online analysis refer to the procedure of early alerting and reacting. The system is extensible and cooperates well with other applications such as Intrusion Detection / Prevention Systems. This paper presents the LHCb Online topology, problems we had to overcome and our solutions. Special emphasis is given to log analysis and how we use it for monitoring and how we can have uninterrupted access to the logs. We provide performance plots, code modification in well known log tools and our experience from trying various storage strategies.
	Slides THCHAUST05 [0.377 MB]