+ All Categories
Home > Documents > Log analysis and user traceability Eygene Ryabinkin, [email protected],[email protected] Russian...

Log analysis and user traceability Eygene Ryabinkin, [email protected],[email protected] Russian...

Date post: 28-Dec-2015
Category:
Upload: charles-lewis
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
18
Log analysis and user traceability Eygene Ryabinkin, [email protected] , Russian Research Centre «Kurchatov Institute» March, 12 th 2009, OSCT-7 meeting, Madrid
Transcript
Page 1: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

Log analysisand user traceability

Eygene Ryabinkin, [email protected],Russian Research Centre «Kurchatov Institute»

March, 12th 2009, OSCT-7 meeting, Madrid

Page 2: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: general ideas

CE logs link Grid jobs to the local jobs, so they are the most logical point to start from.

Jobmap logs are available and they have almost all information: user DN, VOMS FQAN, Grid (EDG) and LRMS job IDs, local user mapping and gatekeeper contact.

With LRMS ID we can trace the job down to the execution nodes. For Torque, one can use either accounting logs or plain job logs. Don't currently know about SGE in YAIM's flavour.

Page 3: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: additional details

Jobmap logs are laid out by date: file names are grid-jobmap_YYYYMMDD. This is very good and handy.

Jobmap logs are missing the IP address of the client, so one should also parse gatekeeper logs – oops! GK logs are huge and ugly, the only unique identifier that links jobmap entry to GK entries is Grid (EDG) job ID. IP address lookup involves GK JM ID lookup and search for IP on the previous entries.

Page 4: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: file locations

• /var/log/globus-gatekeeper.*: most verbose logs about jobs that gatekeeper processes.

• /opt/edg/var/gatekeeper/grid-jobmap_*: summaries of job run by lcgpbs and friends.

• /var/spool/pbs/server_priv/accounting/*: Torque logs that carry most activity traces, we are mainly interested in start/end events.

• /var/spool/pbs/server_logs/*: carry more verbose Torque logs, but exist only on the Torque server, not necessarily on the CE.

Page 5: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: parsing modes – 1

Typical problem 1: some user with a known DN executed some jobs on the local farm in a given interval of time. Find these jobs and, possibly, dig out their details.

Solution: use 'job-search' parsing mode providing the DN (regex, really) of user and time interval. This gives the list of jobs for this user. Modifier '--dig-lrms' instructs the tool to look up job statistics from the LRMS records (currently Torque-only using accounting logs).

Page 6: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: parsing modes – 2

Typical problem 2: we want to trace the job by its Grid ID.

Solution: use 'job-search' parsing mode providing Grid job ID. Jobmap logs are parsed in the time-reversed order and search terminates on the first hit (Grid job IDs are unique), so recent jobs will be found rather quickly. '--dig-lrms' can be used to get LRMS job particulars.

Page 7: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: parsing modes – 3

Typical problem 3: find jobs that are submitted using pure Globus (not LCG/gLite) methods in the given time frame. The rationale is to look who is submitting jobs directly to our CE.

Solution: use 'job-search' parsing mode providing the time range and specifying '--only-direct' switch. This mode will catch only LRMS jobs: usages of fork jobmanager won't be catched.

Page 8: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: parsing modes – 4

Typical problem 4: find all jobs that were using 'fork' jobmanager (direct execution on CE host).

This parsing mode is not finished, but 'job-search' with modifier '--only-fork' and a time range will do the work. One problem is that here we need to parse full gatekeeper logs and extract records that aren't correspond to a regular non-fork jobs. Since normal jobs also use fork jobmanager to spawn grid-monitor/Condor-C, the problem isn't fairly trivial.

Page 9: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: GridFTP

We also have GridFTP logs on the CE. Do we really need to parse them too?

The best request we easily process is the following one: please, find all GridFTP activity for the given user in the given time frame.

We can try relate various GridFTP sessions and even tie them to the jobs, but this will involve heuristics and checks won't be easy.

So, the question is: do we need this?

Page 10: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: current status

• Have a toolset to trace jobs by their Grid (EDG) ID, user DN and to find pure Globus jobs.

• The toolset is currently refactored to provide the framework for doing log lookup on other node types and to abstract file parsers from analysis core.

• Current language is Perl, but I thinking about Python variant – it can be faster and cleaner.

• Will show the tools to the public after some refactoring and polishing.

Page 11: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

lcg-CE logs: roadmap

Finish 'fork' jobs detection. SGE support: Sun GridEngine is currently

supported by gLite too, although user base isn't fairly large now.

Add more bells and whistles to the current tools: limit the number of job records, provide command to find most active users, etc.

Probably implement parsing of GridFTP logs. Anything else I had missed.

Page 12: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

RB/LB logs: ideas and questions

No real code written, only research/planning. RBs are now slightly out-of-fashion, people like

WMS, but still, we have some working RBs. LB has the database where bookkeeping

information is stored and we can use old good SQL to interrogate it. But Daniel said that we

– shouldn't use pure SQL, because of possible schema changes;

– it doesn't have all useful information.

Page 13: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

LB logs: ideas and questions

• Daniel also said that there should be a better way to interrogate LB database, but I always used plain SQL to do it up to now.

• Gathered data will be the same as one provided by 'edg-job-logging-info'. One distinction is that the use of 'job-logging-info' is subject to ACLs, direct usage of SQL DB – isn't.

• In the case of combined LB/RB (or LB/WMS) can also extract some information from the SandBox directory.

Page 14: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

RB logs: GridFTP

GridFTP logs on RBs are minimal: no session traces, just accounting data in /var/log/edg-wl-in.ftpd.log. No user DN's, only poolaccount user names. Some path names carry job IDs, so we can identify user sessions and can relate them to the jobs – this could be handy.

In principle, it is sometimes interesting to know who got user's output sanbox, so we probably should try to parse these logs.

Page 15: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

WMS/Cream CE

No real work was done up to date, only planning.

I have WMS instance, so I plan to research on what data could be collected from this node type. I expect that job traces simular to RB ones and download upload records (both GridFTP and HTTP) will be available.

Cream CE instance is going to be deployed in a couple of months. Once it will be up – I'll analyze it too.

Page 16: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

Data management: SE logs

• Only in plans, no real work was done.

• Can only speak about DPM SE for now: have no dCache instance.

• As a recall from the SSC2, DNPS and DPM logs have some shared identifiers that can be used to relate the records in the various log files.

• Needs more analysis: I hadn't concentrated on the DM logs yet.

Page 17: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

Thanks!

Thanks for Daniel Kouril for presenting this stuff and discuissing/advicing on the presentation.

Thanks to everyone listened to this session.

Page 18: Log analysis and user traceability Eygene Ryabinkin, rea@grid.kiae.ru,rea@grid.kiae.ru Russian Research Centre «Kurchatov Institute» March, 12 th 2009,

Questions? Suggestions?Feel free to ask ;))


Recommended