Author: Vino, G.
Paper Title Page
TUDPP01 A Monitoring System for the New ALICE O2 Farm 835
 
  • G. Vino, D. Elia
    INFN-Bari, Bari, Italy
  • V. Chibante Barroso, A. Wegrzynek
    CERN, Meyrin, Switzerland
 
  The ALICE Experiment has been designed to study the physics of strongly interacting matter with heavy-ion collisions at the CERN LHC. A major upgrade of the detector and computing model (O2, Offline-Online) is currently ongoing. The ALICE O2 farm will consist of almost 1000 nodes enabled to readout and process on-the-fly about 27 Tb/s of raw data. To increase the efficiency of computing farm operations a general-purpose near real-time monitoring system has been developed: it lays on features like high-performance, high-availability, modularity, and open source. The core component (Apache Kafka) ensures high throughput, data pipelines, and fault-tolerant services. Additional monitoring functionality is based on Telegraf as metric collector, Apache Spark for complex aggregation, InfluxDB as time-series database, and Grafana as visualization tool. A logging service based on Elasticsearch stack is also included. The designed system handles metrics coming from operating system, network, custom hardware, and in-house software. A prototype version is currently running at CERN and has been also successfully deployed by the ReCaS Datacenter at INFN Bari for both monitoring and logging.  
slides icon Slides TUDPP01 [1.128 MB]  
DOI • reference for this paper ※ https://doi.org/10.18429/JACoW-ICALEPCS2019-TUDPP01  
About • paper received ※ 30 September 2019       paper accepted ※ 10 October 2019       issue date ※ 30 August 2020  
Export • reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)