SRA International, Inc., A CSRA Company Senior HPC Systems Engineer in Boulder, Colorado
Clearance Level Must Currently Possess:
No Active Clearance Required
Clearance Level Must Be Able to Obtain:
No Active Clearance Required
Engineering & Sciences
CSRA has an opportunity available for a talented and innovative senior-level High Performance Computing (HPC) Linux Systems Administrator within our High Performance Computing Center of Excellence to provide the continuing support for our NOAA Research and Development High Performance Computing Systems customer at NOAA's Earth System Research Laboratory in Boulder, Colorado.
The qualified candidate will bring their hands-on technical and project management leadership skills to: ensure HPC environment stability; plan for growth; and manage and support new technology insertions as well as provide remote technical support and consultation to our other supported NOAA sites at Fairmont, West Virginia and Princeton, New Jersey.
Responsibilities and Duties:
Experience in daily management and oversight of medium to large HPC cluster environments is essential in order to maintain an overall situational awareness of HPC environment to identify support or operational issues before they impact operations;
Independent problem solving and troubleshooting skills will be leveraged to quickly advance towards viable resolutions;
Provide leadership and team coordination for successful planning and execution of scheduled maintenance periods;
Hands-on experience with Lustre, NFS, and other NAS and parallel file systems will be heavily leveraged to ensure optimal performance tuning and planning for growth/expansion;
Experience with provisioning tools, such as xCAT; will be used to manage the build and/or rebuilding of HPC cluster front ends, compute and administration nodes in both diskfull and diskless environments; Puppet will also be used to provide and deliver consistent configuration management within this HPC environment;
Demonstrated ability to install open source software packages;
Experienced with the installation of commercial software and license managers for commercial software;
Demonstrated ability to install and tune compiler and utility software;
Programming experience and familiarity with scripting languages such as bash, sh, Perl, and Python will be used to manage, extend, and develop customized scripts to support the HPC user and system administration environments;
Formal change and configuration management practices are ingrained into the daily operations to ensure changes and configuration control implementations are properly documented and approved;
Excellent communication skills, both verbal and written, are required in order to ensure the customer, support team, and HPC users and other stakeholders are properly engaged and informed throughout any support or project management effort;
Experienced leadership in project management will be leveraged to ensure projects are properly planned and developed, ensure stakeholder and project team coordination, proper project execution and monitoring is performed, and the projects are brought to closure, as scheduled;
Knowledge of network technologies, such as InfiniBand and GigE, and their associated tools will be leveraged to troubleshoot and tune existing network fabrics and plan for future expansion;
Documentation skills will be applied to develop, improve, and enhance user and system administration online documentation repositories; Extensive experience with Microsoft Suite (Project, Visio, Word, and Excel) will be leveraged considerably to support the various documentation efforts;
Coordination of troubleshooting and repairs with OEM vendors will be essential to identify root cause of issues, their timely resolution, and restoration of SLA-related services;
System administration responsibilities will include timely coordination of security patching for any identified IT security-related vulnerabilities;
Ability to work and contribute as a member of a small support team is an essential component within this effort;
Extended knowledge of batch scheduling and queuing systems (such as Moab/Torque or Slurm) is a plus;
10 years or more years of experience in Systems Administration.
Bachelor's degree, or equivalent, in computer-related field, CS preferred.
Hands-on experience with Linux Red Hat and CentOS in particular.
Hands-on experience with computer hardware maintenance, such as replacing DIMMs, disk drives, and PCIe cards.
Experience with Lustre, NFS, and other NAS and parallel file systems.
Understanding of basic networking components, and tools, with a solid understanding of routing concepts.
Experience installing and removing software, both from prebuilt packages and compiling from source.
Experience in Linux/Unix programming or scripting (including Perl and Bash), and interest in task automation.
Ability to work in both local and remote technical support environments.
Strong creative problem solving skills to tackle highly complex large-scale technical problems.
Disciplined troubleshooting skills.
Experience in project and technical management.
Attention to detail skills in the areas such as; time management, organizational, analytical thinking, observation, and active listening.
Exceptional verbal and written communication skills.
Nice to have:
Experience in developing and maintaining software stacks.
Experience with InfiniBand is a plus.
Experience in writing C programs is a plus.
Working knowledge of batch scheduling and queuing systems (such as Moab/Torque or Slurm) is a plus.
# of Openings:
Scheduled Weekly Hours:
T elecommuting Options:
Some Telecommuting Allowed
USA CO Boulder - 325 Broadway (COC029)
Additional Work Locations:
CSRA is committed to creating a diverse environment and is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.
THINK NEXT. NOW.
CSRA is tomorrow’s thinking, today. To “Think Next. Now.” is to imagine a better future and to deliver it, today. For our customers, our partners, and ultimately, all the people our mission touches, CSRA is realizing the promise of technology to change the world through next-generation thinking and meaningful results.
We understand that our customers' missions require new methods and imaginative thinking. We bring together government IT professionals, emerging technologies, and the brightest, cutting-edge advisors in the industry to deliver a broad range of innovative, next-generation IT solutions and professional services to help our customers modernize their legacy systems, protect their networks and assets, and improve the effectiveness and efficiency of mission-critical functions for our warfighters and our citizens.
Everywhere you look, CSRA is there. We’re in our nation’s infrastructure, in training and education, in cyber security, in serving veterans who served us—and, so much more. Take some time to learn more about CSRA. You might be surprised to learn how we touch your life.
We are a company of 18,000+ smart, talented individuals, yet we enjoy a start-up culture that inspires us to make a difference while delivering results in this rapidly evolving world. Join our team and use your skills and expertise to support the safety, security, health and well-being of the nation.