Harvard Fas Academic Cluster

Cole Killian

Created: 2020-08-27 Thu 15:05

Table of Contents

The Academic Cluster

The Problem We Want To Address

Classes are requiring more and more access to compute resources.

Previous Solutions

  • Giving students access to the main cluster
    • High friction
    • Different Requirements
  • Deploying infrastructure with AWS
    • Expensive

The Goals of the Academic Cluster

  • Reduce friction between users and compute resources.
  • Accomodate student requirements
  • Reduce Costs

Navigate to Course in Canvas

canvas_fas_example.png

Waiting Page

new_waiting_page_existing_user.png

Dashboard

job_creation.png

Architecture

Improvements to the Academic Cluster

  • User Experience
  • Backend Stability and Scaling
  • Frontend Stability and Scaling

User Experience

Not one big project. Lots of small things, each with the goal of improving the user experience.

The Old Waiting Page

old_waiting_page.png

The New Waiting Page

First time user new_waiting_page_first_time_user.png

Existing user new_waiting_page_existing_user.png

The Old 403 Page

old_403_page.png

The New 403 Page

new_403_page.png

Quota Display

quota_demo.png

Straight To Jobs

Send users straight to job creation.

job_creation.png

Backend Stability and Scaling

We know that the backend user account creation service works, but not how well it scales.

Understanding Account Creation

Account_Creation_Flow_Before.png

Testing Tools

  • Rolled our own backend testing tools with bash
  • Testing tools enabled us to find account creation bottlenecks.
  • Find documentation on running a test in the account-creation respository.

Account Creation After

Account_Creation_Flow_After.png

Improvements from Testing

  • Processing user account requests in parallel instead of in sequence
  • Offloading ssh key generation to a separate queue of tasks
  • Brought speed from 0.25 accounts per second to 15+ accounts per second.

Current State of Backend:

Testing with 100 requests sent over the course of 3 seconds. Average results over 10 trials:

User Accounts Created Per Second 17.6734155
Total Time to Create 100 User Accounts 7.796257014
Max % CPU time spent in user space 82
Max % CPU time spent in kernel space 46.94
ssh keys generated per second 2.069793255
Total Time to Create 100 ssh keys 48.66440003

Monitoring

  • Cleaned up code and added documentation
  • Ready to plug in to monitoring

Frontend Stability and Scaling

Similarly to the backend, we know that the frontend service works, but not how many users it can support.

Testing Tools

  • Selenium
    • Designed for functional testing.
    • Too computationally expensive.
  • Jmeter
    • Designed for performance testing
    • Not flexible. GUI interface.
  • Locust
    • Designed for performance testing.
    • Flexible. Write your tests in python.

In Designing the Benchmarks we were interested in:

  • How long users spend on the waiting page
  • How long it takes users to make it to the dashboard.
  • How many users each host can support at once
  • How slurm responds when users concurrently start jobs

Testing Flows

Testing_Flow.png

Findings from Testing

  • Discovered an incompatibility between our cookie decryption and the apache mpm_event module.
  • Exposed an unexpectedly high load average.
  • Developed confidence in the number of users that the system can support.

The Story of Cookie Decryption

  • Originally using mpm_prefork
  • OSC recommend switching to mpm_event, so we gave it a try
  • Discovered that cookie decryption wasn't thread safe
  • Switched to a thread safe decryption module
  • Discovered that cookie encryption wasn't thread safe
  • Reverted to mpm_prefork

Load Average Graph

loadavg_graph.png

Have you seen this before?

Timestamp PID %us %sy %CPU Command
11:55:14 64735 0.4 43.8 44.2 ps
11:55:14 64736 0.8 47 47.8 ps
11:55:14 64737 0.4 39.6 40 ps
11:55:14 64738 0.2 42.6 42.8 ps
11:55:14 64739 0.8 55.8 56.6 ps
11:55:14 64743 1 54.4 55.4 ps
11:55:14 64745 0.8 47.2 48 ps
11:55:14 64746 0.6 47.8 48.4 ps
11:55:14 64748 0.6 36.2 36.8 ps
11:55:14 64749 0.8 56.4 57.2 ps
11:55:14 64750 0.4 43.4 43.8 ps
11:55:14 64751 0.8 59 59.8 ps
11:55:14 64752 0.6 50.4 51 ps
11:55:14 64753 1 56.4 57.4 ps
11:55:14 64754 0.4 46 46.4 ps
11:55:14 64755 1 50.2 51.2 ps
11:55:14 64756 0.6 58.2 58.8 ps
11:55:14 64759 0.8 46 46.8 ps
11:55:14 64760 0.4 38.4 38.8 ps
11:55:14 64761 0.4 42.2 42.6 ps
11:55:14 64762 0.4 39 39.4 ps
11:55:14 64763 0.4 37 37.4 ps
11:55:14 64764 0.8 55.4 56.2 ps
11:55:14 64766 0.6 39.2 39.8 ps
11:55:14 64767 0.4 44 44.4 ps

Next Steps

#!/usr/bin/env bash
while true; do
    pkill -c ps
    sleep 0.5
done
  • I posted on the OOD Discourse
  • Reduce frequency of ps calls by changing source code and recompiling Passenger
  • Test on a vm to see if it is a problem with the hardware

Current Performance: New Users

Test Specs Avg Waiting Page Time Avg Time To Dash
10-2 20.76881588 28.77038767
100-25 24.94150376 40.97210661
200-50 31.5050671 66.12771025

Current Performance: Existing Users

Test Specs Avg Waiting Page Time Avg Time To Dash
10-2 0.2501593351 7.135278559
100-25 0.2712510157 8.81417577
200-50 0.3555522096 16.19014084

Monitoring

  • Cleaned up code and added documentation
  • Ready to plug in to monitoring

Questions?

Appendix

Technical Overview

academic-cluster-technical-overview.png

OnDemand Architecture

ood_overview.png

Jupyter Notebook Timing

  Avg Time To Jupyter Notebook
30-5 58.42915653
15-5 48.30440068
5-2 35.00484438

Converting xfs quota to json

xfs_quota output

[root@academic-perftestsfs /]# xfs_quota -x -c 'report -u -bir' $xfs_path
User quota on /srv/export/g_34166 (/dev/vdb)
                               Blocks                                          Inodes                          
User ID          Used       Soft       Hard    Warn/Grace           Used       Soft       Hard    Warn/ Grace   
---------- -------------------------------------------------- --------------------------------------------------
root                8          0          0     00 [--------]          6          0          0     00 [--------]
u_316301_g_34166        932          0          0     00 [--------]        274          0          0     00 [--------]
u_270426_g_73222          0   19922944   20971520     00 [--------]          0      95000     100000     00 [--------]

json format

{
  "version": 1,
  "timestamp": 1596732082.017182,
  "quotas": [
    {
      "block_limit": 1000,
      "file_limit": 400,
      "path": "/n/academic_homes/g_34166/u_316301_g_34166",
      "total_block_usage": 932,
      "total_file_usage": 274,
      "user": "u_316301_g_34166"
    },
    {
      "block_limit": 19922944,
      "file_limit": 95000,
      "path": "/n/academic_homes/g_73222/u_270426_g_73222",
      "total_block_usage": 0,
      "total_file_usage": 0,
      "user": "u_270426_g_73222"
    },
  ]
}

How to run a backend test

Find the docs here. run-backend-test works by:

  • Clears sss_cache
  • Starts test monitoring tools
  • Starts account-request and account-creation services.
  • Sends account requests
  • Waits until all the ldifs are moved out of trigger directory, and until all ssh keys are generated.
  • Runs deletes all the generated users from ldap and removes their home directories.

It accepts five parameters.

  • –num_requests: The number of requests to simulate.
  • –request_threads: The number of account creation threads to run.
  • –key_threads: The number of ssh-keygen threads to run.
  • –fulltest: A flag which indicates to run a series of tests instead of one test
  • –integrated: A flag which indicates to run an integrated test instead of an ldap test

How to run a locust test

Find the docs here

locust 2>&1 --csv ${OUT_DIR}/ --csv-full-history \
    | tee ${OUT_DIR}/stdout.log &
  • csv flags are for exporting data
  • stdout.log are where custom metrics like waiting page time and dashboard time are stored

Smaller Stuff not worth mentioning

  • Explicit list of allowed groups
  • Make compatible with upstream OOD code.
  • Move hardcoded values into configurable environment variables.
  • Implemented uid and gid check into account-request.
  • Implemented preventing account-creation from writing local directories. Currently happening at the account-creation level, but could potentially occur at the account-request level.
  • Developed rpms for easier deployment