Author: Brarda, L.
Paper Title Page
WEMMU005 Fabric Management with Diskless Servers and Quattor on LHCb 691
 
  • P. Schweitzer, E. Bonaccorsi, L. Brarda, N. Neufeld
    CERN, Geneva, Switzerland
 
  Large sci­en­tif­ic ex­per­i­ments nowa­days very often are using large com­put­er farms to pro­cess the events ac­quired from the de­tec­tors. In LHCb a small sysad­min team man­ages 1400 servers of the LHCb Event Fil­ter Farm, but also a wide va­ri­ety of con­trol servers for the de­tec­tor elec­tron­ics and in­fras­truc­ture com­put­ers : file servers, gate­ways, DNS, DHCP and oth­ers. This va­ri­ety of servers could not be han­dled with­out a solid fab­ric man­age­ment sys­tem. We choose the Quat­tor toolk­it for this task. We will pre­sent our use of this toolk­it, with an em­pha­sis on how we han­dle our disk­less nodes (Event fil­ter farm nodes and com­put­ers em­bed­ded in the ac­qui­si­tion elec­tron­ic cards). We will show our cur­rent tests to re­place the stan­dard (Red­Hat/Sci­en­tif­ic Linux) way of han­dling disk­less nodes to fu­sion filesys­tems and how it im­proves fab­ric man­age­ment.  
slides icon Slides WEMMU005 [0.119 MB]  
poster icon Poster WEMMU005 [0.602 MB]  
 
WEPMU037 Virtualization for the LHCb Experiment 1157
 
  • E. Bonaccorsi, L. Brarda, M. Chebbi, N. Neufeld
    CERN, Geneva, Switzerland
  • F. Sborzacchi
    INFN/LNF, Frascati (Roma), Italy
 
  The LHCb Ex­per­i­ment, one of the four large par­ti­cle physics de­tec­tors at CERN, counts in its On­line Sys­tem more than 2000 servers and em­bed­ded sys­tems. As a re­sult of ev­er-in­creas­ing CPU per­for­mance in mod­ern servers, many of the ap­pli­ca­tions in the con­trols sys­tem are ex­cel­lent can­di­dates for vir­tu­al­iza­tion tech­nolo­gies. We see vir­tu­al­iza­tion as an ap­proach to cut down cost, op­ti­mize re­source usage and man­age the com­plex­i­ty of the IT in­fras­truc­ture of LHCb. Re­cent­ly we have added a Ker­nel Vir­tu­al Ma­chine (KVM) clus­ter based on Red Hat En­ter­prise Vir­tu­al­iza­tion for Servers (RHEV) com­ple­men­tary to the ex­ist­ing Hy­per-V clus­ter de­vot­ed only to the vir­tu­al­iza­tion of the win­dows guests. This paper de­scribes the ar­chi­tec­ture of our so­lu­tion based on KVM and RHEV as along with its in­te­gra­tion with the ex­ist­ing Hy­per-V in­fras­truc­ture and the Quat­tor clus­ter man­age­ment tools and in par­tic­u­lar how we use to run con­trols ap­pli­ca­tions on a vir­tu­al­ized in­fras­truc­ture. We pre­sent per­for­mance re­sults of both the KVM and Hy­per-V so­lu­tions, prob­lems en­coun­tered and a de­scrip­tion of the man­age­ment tools de­vel­oped for the in­te­gra­tion with the On­line clus­ter and LHCb SCADA con­trol sys­tem based on PVSS.  
 
THCHAUST05 LHCb Online Log Analysis and Maintenance System 1228
 
  • J.C. Garnier, L. Brarda, N. Neufeld, F. Nikolaidis
    CERN, Geneva, Switzerland
 
  His­to­ry has shown, many times com­put­er logs are the only in­for­ma­tion an ad­min­is­tra­tor may have for an in­ci­dent, which could be caused ei­ther by a mal­func­tion or an at­tack. Due to huge amount of logs that are pro­duced from large-scale IT in­fras­truc­tures, such as LHCb On­line, crit­i­cal in­for­ma­tion may over­looked or sim­ply be drowned in a sea of other mes­sages . This clear­ly demon­strates the need for an au­to­mat­ic sys­tem for long-term main­te­nance and real time anal­y­sis of the logs. We have con­struct­ed a low cost, fault tol­er­ant cen­tral­ized log­ging sys­tem which is able to do in-depth anal­y­sis and cross-cor­re­la­tion of every log. This sys­tem is ca­pa­ble of han­dling O(10000) dif­fer­ent log sources and nu­mer­ous for­mats, while try­ing to keep the over­head as low as pos­si­ble. It pro­vides log gath­er­ing and man­age­ment, of­fline anal­y­sis and on­line anal­y­sis. We call of­fline anal­y­sis the pro­ce­dure of an­a­lyz­ing old logs for crit­i­cal in­for­ma­tion, while On­line anal­y­sis refer to the pro­ce­dure of early alert­ing and re­act­ing. The sys­tem is ex­ten­si­ble and co­op­er­ates well with other ap­pli­ca­tions such as In­tru­sion De­tec­tion / Pre­ven­tion Sys­tems. This paper pre­sents the LHCb On­line topol­o­gy, prob­lems we had to over­come and our so­lu­tions. Spe­cial em­pha­sis is given to log anal­y­sis and how we use it for mon­i­tor­ing and how we can have un­in­ter­rupt­ed ac­cess to the logs. We pro­vide per­for­mance plots, code mod­i­fi­ca­tion in well known log tools and our ex­pe­ri­ence from try­ing var­i­ous stor­age strate­gies.  
slides icon Slides THCHAUST05 [0.377 MB]