Paper |
Title |
Page |
TUDPP01 |
A Monitoring System for the New ALICE O2 Farm |
835 |
|
- G. Vino, D. Elia
INFN-Bari, Bari, Italy
- V. Chibante Barroso, A. Wegrzynek
CERN, Meyrin, Switzerland
|
|
|
The ALICE Experiment has been designed to study the physics of strongly interacting matter with heavy-ion collisions at the CERN LHC. A major upgrade of the detector and computing model (O2, Offline-Online) is currently ongoing. The ALICE O2 farm will consist of almost 1000 nodes enabled to readout and process on-the-fly about 27 Tb/s of raw data. To increase the efficiency of computing farm operations a general-purpose near real-time monitoring system has been developed: it lays on features like high-performance, high-availability, modularity, and open source. The core component (Apache Kafka) ensures high throughput, data pipelines, and fault-tolerant services. Additional monitoring functionality is based on Telegraf as metric collector, Apache Spark for complex aggregation, InfluxDB as time-series database, and Grafana as visualization tool. A logging service based on Elasticsearch stack is also included. The designed system handles metrics coming from operating system, network, custom hardware, and in-house software. A prototype version is currently running at CERN and has been also successfully deployed by the ReCaS Datacenter at INFN Bari for both monitoring and logging.
|
|
|
Slides TUDPP01 [1.128 MB]
|
|
DOI • |
reference for this paper
※ https://doi.org/10.18429/JACoW-ICALEPCS2019-TUDPP01
|
|
About • |
paper received ※ 30 September 2019 paper accepted ※ 10 October 2019 issue date ※ 30 August 2020 |
|
Export • |
reference for this paper using
※ BibTeX,
※ LaTeX,
※ Text/Word,
※ RIS,
※ EndNote (xml)
|
|
|