SSCC News June 2024

Welcome, New Members from the School of Education!

The School of Education (SoE) has joined the SSCC for a one-year pilot membership, after which we’ll jointly evaluate a permanent membership. We’ve scheduled several orientation sessions in the coming month, mostly (but not exclusively) for new members from SoE–visit our training page for details and to register. There will be more offered in the fall. Send any questions you have to the SSCC Help Desk. We’re eager to help you get started using SSCC resources.

Changes to Storage Limits at the SSCC

Data storage has become a major expense for the SSCC. On average, it costs the SSCC $82 to store and back up 1TB of data for one year (compared to the $120/TB/year charge for ResearchDrive space beyond the 25TB given to every PI for free) but in FY 2024 alone we’ve spent roughly $130,000 on storage. We’re asking all of our members who use significant amounts of storage to make sure they’re using it as efficiently as possible. To encourage efficient use, we have implemented a new set of quotas for project directories and new tools to manage them.

New project directories will start with a quota of 1TB while existing project directories have been given a quota equal to at least their current usage plus 20%. If you need more space you can request a quota of 5TB by visiting the SSCC self-service portal. You must be on campus or connected to a campus VPN (UW-Madison VPN, SSCC VPN, etc.) to access the portal. If you need more space beyond that, you can request a quota of 25TB using the same form. Your sponsoring agency will be informed of this request. If you still need more space, email the Help Desk and we’ll schedule a meeting to discuss your needs.

Storing data as very large numbers of small files uses space inefficiently and slows down the file server. Putting many files in a single directory is especially problematic–we suggest staying under about 1,000 files per directory. To maintain the performance of the file system, we have implemented a quota for project and home directories of either 1 million files or 20% more than the current file count, whichever is greater. You can request an increase to 5 million files for project directories using the portal. We will later reduce the maximum quota to five million files for all projects. This may require some changes to how you store your data; the SSCC’s Help Desk staff and statistical consultants will be happy to help.

We appreciate your help in making sure our shared computing resources are used as efficiently as possible.

Default Linux Shell to Change to Bash

On July 19th, we will change the default Linux shell for most SSCC members from tcsh to bash. We will not change the shell of anyone who has customized tcsh. You can change your own shell by visiting the new self-service portal.

The “shell” is what you see when you first log into a Linux server. Many, many years ago the SSCC chose tcsh as our default shell because it made it easier to run long jobs (the kind of jobs that go to Slurm now). However, in the ensuing years, most of the world settled on bash because it’s better for writing shell scripts. If you do a web search for how to do something on a Linux server, what you find will probably assume you’re using bash. Most Linstat and LinSilo users don’t need to worry about this change and won’t notice any significant differences. But if you find you prefer tcsh, you can now change your shell at any time using the self-service portal.

SSCC Self-Service Portal

The SSCC self-service portal is now available and can be used for things like setting your preferred email address, updating your affiliations, requesting more storage space, checking your current storage usage, and changing your Linux shell. You must be on campus or connected to a campus VPN (UW-Madison VPN, SSCC VPN, etc.) to access the portal.

Blimp Now Available On SSCC Servers

Blimp is a standalone program that uses Bayesian estimation to handle missing data. Blimp is both easier and more flexible than multiple imputation. We have documentation to guide you on how to use Blimp on our servers.

Someone Submitted 1,000 Jobs To Slurm! Now What?

A large research computing cluster like Slurm is designed to get as much work done as possible, not necessarily to get a particular job done as soon as possible. In that sense it’s a good thing when new jobs can’t start right away–it means servers aren’t sitting idle. Jobs having to wait a few hours before they start is not considered a problem on a big cluster, but even that’s very rare at the SSCC. The Slurm Status page will tell you how long jobs have had to wait over the past three days.

Slurm prioritizes jobs based on how long they’ve been waiting and how much compute time the user has used recently. So if someone submits 1,000 jobs to the cluster, they will very quickly become the very lowest priority. That doesn’t mean their jobs will be preempted by other jobs (unless those jobs are submitted to a higher priority partition) but it does mean that when a job finishes and computing resources become available, Slurm will start up anyone else’s job first. So just because you see 1,000 jobs in the Slurm queue does not mean you can’t use Slurm–get your jobs in the queue too and Slurm will start them in short order.

On the other hand, that may be a good time to take advantage of partitions other than the default “sscc” partition, like “short” (includes servers reserved for jobs that will run for less than six hours) or “econ-grad” and “econ-fac” (servers where Economics grad students or faculty have priority because the Economics Department paid for part of the cost). The Slurm Partitions and Priorities in the Guide to Research Computing at the SSCC will tell you more.

Using AI to Write Code

All UW-Madison faculty, staff, and students now have access to Microsoft’s Copilot AI, including GPT-4. One of the more intriguing uses for large language models like Copilot or GPT is writing code, and the SSCC’s statistical consultants have been experimenting with it. The results are in many ways impressive: it’s quite good at understanding complex questions, especially if you include specific examples, and the code it generates is mostly right. The trouble is, code that is mostly right usually produces results that are completely wrong. Any code generated by AI needs to be checked carefully and often corrected.

You cannot use Copilot as an alternative to learning how to code. Using code you don’t understand will frequently lead to frustration, as you encounter problems you don’t know how to solve. The biggest problem is that you can’t be sure the code is actually doing what you want and giving you the correct results. (This is true whether the code came from AI or from the web.) Researchers who want to work with data should learn a language like R, Stata, or Python and how to work with data (“data wrangling”) in that language.

While you could ask Copilot to teach you to code, we humbly suggest SSCC’s training or Statistical Computing Knowledge Base instead. Any learning materials written by a knowledgeable human will probably be better written and more trustworthy than what Copilot can produce. On the other hand, if you’re working through online material on your own, you can ask Copilot questions and use it as a virtual instructor.

Asking Copilot to write code you could easily write may or may not save you time, as understanding and checking the code it gives you may take just as long as writing it yourself.

Copilot can really help you out if you know how to code in general but aren’t sure how to carry out a particular task. If you explain the task in some detail, the code Copilot gives you will probably be either correct or easily fixable, and making the code work will probably be faster than figuring out the task yourself. And you will learn from Copilot’s solution.

We encourage you to take advantage of Copilot as you’re writing code. Just be aware of its limitations and use it for what it’s good at. In the end, you are responsible for the code that generates your results no matter how it was written.

Computer Summer Cleaning

With Summer in full swing, now is a great time to take some time to do a quick review of your computer and see if any changes need to be made. Things like:

  • Review the software installed on your machine to see if it is still needed.
  • Go through your documents and other files to see if you need to reorganize or remove files you no longer need.
  • If the SSCC does not manage your computer, make sure that it is updated with the latest version of your installed software and operating system to help keep it safe and secure.