Linux essentials for DS/ML

Linux essentials for DS/ML

Stay true

#️⃣  Tools ⌛  ~50 min 🗿  Beginner

21.07.2022

upd:

Linux essentials for DS/ML

Stay true

⌛  ~50 min

Operating systemsShellScriptingBashZshVimNeovimAutomationWorkflowsData pipelinesSSH

🎓 2/2

This post is a part of the Essentials educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Passive-aggressive system debugging

Linux has become an indispensable tool in the arsenal of data scientists and machine learning engineers. Why? Because it offers unparalleled control, flexibility, and efficiency when dealing with high-performance computing tasks, massive datasets, and custom workflows. When you're training complex models, running data pipelines, or deploying machine learning systems at scale, Linux gives you the power to optimize every layer of your workflow. Whether it's handling multi-threaded computations or tweaking system-level configurations to squeeze out every bit of performance, Linux provides the freedom to do so.

Unlike other operating systems, Linux excels in its native compatibility with virtually every tool and framework you'll use, from TensorFlow and PyTorch to distributed computing systems like Apache Spark. Open-source ecosystems thrive here, giving you access to cutting-edge research and community support. Not to mention, the server environments you'll deploy your models to — whether in AWS, Google Cloud, or Azure — run on Linux, so having fluency in this OS is essential.

Why Manjaro? For this blog, we'll use Manjaro Linux as our example distribution. Built on Arch Linux, Manjaro provides the best of both worlds: a cutting-edge, rolling-release system combined with user-friendly tools and stability. Manjaro's extensive package manager, pacman, along with its AUR (Arch User Repository), ensures that every library, tool, or dependency you need is readily available. Plus, its minimalist philosophy makes it an excellent playground for learning Linux without unnecessary distractions.

Command line

The command line is the beating heart of Linux. While graphical interfaces exist, they're often bypassed by professionals for the raw power and automation capabilities of the command line. Mastering it is akin to learning a new programming language — initially challenging but infinitely rewarding.

File system hierarchy

Linux organizes its files into a structured tree known as the file system hierarchy. Here are some key directories you'll interact with frequently:

/etc: Configuration files for the system and installed software. For example, if you need to tweak the behavior of your Python installation or manage system-wide environment variables, this is where you'll look.
/var: Stores variable data such as logs, caches, and runtime files. Logs generated by tools like cron or systemd are invaluable for debugging.
/opt: Optional third-party software. If you're running custom ML tools that aren't in the standard repositories, they might live here.
/usr/local: User-installed software. When you compile and install from source, this is the default destination.

Understanding this structure lets you intuitively locate configuration files, log outputs, and software installations.

Basic commands

While the Linux command line is vast, let's start with a core set of commands you'll use daily:

ls: Lists files and directories. Use flags like -l for detailed info or -a to show hidden files.


ls -la /etc

cd: Changes the working directory.


cd /usr/local/bin

pwd: Prints the current working directory.

pwd

cp: Copies files or directories. Use -r for recursive copies.


cp -r /data/old /data/new

mv: Moves (or renames) files.


mv model.py old_model.py

rm: Removes files or directories. Be cautious with rm -rf to avoid unintentional deletions.


rm -rf /tmp/unnecessary_files

cat: Concatenates and displays file content.


cat logs.txt

less: Views file content page by page, useful for large logs.


less system.log

Recursive operations

When working with datasets, you'll often need to perform operations across multiple files or directories. This is where find and xargs shine:

Find all .csv files and print their paths:


find /data -name "*.csv"

Delete all .tmp files (use caution):


find /data -name "*.tmp" | xargs rm

Batch renaming

Renaming hundreds of files manually is tedious. The rename command simplifies this:

Replace old with new in file names:


rename 's/old/new/' *.txt

Disk usage analysis

When your disk fills up with datasets and checkpoints, disk usage tools become essential:

du: Summarizes file and directory sizes. Use -h for human-readable output.


du -h /data

df: Shows available disk space on mounted filesystems.


df -h

ncdu: A terminal-based disk usage analyzer that's both visual and interactive. Install it with:


sudo pacman -S ncdu

Misc

Use tab completion to save time typing paths or commands.
Leverage history (history command or pressing Ctrl+R) to recall previous commands.
Alias frequently used commands in your ~/.bashrc or ~/.zshrc. For example:


alias ll='ls -la'

Bash and Zsh

Switching from bash to zsh on Manjaro is a straightforward process and can significantly enhance your shell experience. While bash is a robust default, zsh introduces advanced features like better autocompletion, customizable prompts, and plugin support.

To switch to zsh, first ensure it is installed:


sudo pacman -S zsh

Then set it as your default shell:


chsh -s /bin/zsh

Restart your terminal to activate the change. You'll immediately notice a difference in the shell's behavior and aesthetics.

Setting up a productive shell environment

One of the most popular tools for enhancing zsh is oh-my-zsh, a framework that makes customization effortless. Install it with:


sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

Once installed, you can activate plugins tailored for data science, such as:

git: For seamless Git integration.
zsh-autosuggestions: Provides command suggestions as you type.
zsh-syntax-highlighting: Adds syntax highlighting for commands.

To enable these, edit your ~/.zshrc:


plugins=(git zsh-autosuggestions zsh-syntax-highlighting)

Both ~/.bashrc and ~/.zshrc are startup files where you can define aliases, environment variables, and custom functions. Examples:

Add Python virtual environments:


export WORKON_HOME=$HOME/.virtualenvs
source /usr/local/bin/virtualenvwrapper.sh

Define shortcuts for common tasks:


alias activate="source venv/bin/activate"
alias jn="jupyter notebook"

Command chaining and redirection

Mastering shell command chaining and redirection is crucial for streamlining workflows:

Command chaining:
- &&: Executes the second command only if the first succeeds.


mkdir new_folder && cd new_folder

||: Executes the second command only if the first fails.


mkdir existing_folder || echo "Folder already exists."

;: Executes commands sequentially, regardless of success or failure.


echo "Hello"; echo "World"

Piping (|): Sends the output of one command as input to another.


cat large_file.txt | grep "keyword"

Redirection:
- >: Overwrites a file with command output.


echo "Data" > file.txt

>>: Appends output to a file.


echo "More Data" >> file.txt

<: Reads input from a file.


wc -l < file.txt

Writing efficient one-liners

Shell one-liners are powerful tools for quick data manipulations. Here are a few examples:

Count unique words in a file:


cat file.txt | tr ' ' '
' | sort | uniq -c

Extract the first column of a CSV:


cut -d',' -f1 data.csv

Monitor GPU usage (NVIDIA):


watch -n 1 nvidia-smi

Vim and Neovim

For quick edits or comprehensive development tasks, mastering vim or neovim (nvim) is a game-changer.

Basic commands

Vim is modal, meaning it operates in different modes:

Insert Mode: For editing text (i to enter, Esc to exit).
Normal Mode: For navigating and executing commands.
Command Mode: For running commands (:).

Some essential commands:

Save and quit:

:wq

Delete a line:

dd

Undo and redo:


u (undo), Ctrl+r (redo)

Search for text:


/keyword

Neovim enhances Vim with modern features like asynchronous plugins and an embedded Lua scripting interface. Install it on Manjaro with:


sudo pacman -S neovim

Configuring Vim/Neovim for Python

Install vim-plug for plugin management:


curl -fLo ~/.local/share/nvim/site/autoload/plug.vim --create-dirs       https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim

Add Python-specific plugins to ~/.config/nvim/init.vim:


call plug#begin('~/.vim/plugged')
Plug 'dense-analysis/ale'        " Linting
Plug 'vim-python/python-syntax' " Python syntax
call plug#end()

Configure Jupyter Notebook support with plugins like vim-slime.

Misc

Macros: Record a sequence of commands with q, execute them with @.
Search and Replace:


:%s/old/new/g

Navigation with Markers: Use m to mark locations, navigate with '.

Shell scripting

Shell scripting is an indispensable skill for data scientists and machine learning engineers who work with Linux. It bridges the gap between manual work and automated workflows, saving countless hours and reducing errors. A shell script is essentially a series of commands written in a text file, executed by a shell interpreter like bash or zsh. Let's break down some key components and practical examples.

Writing simple scripts

A shell script starts with a shebang (#!) followed by the path to the shell interpreter:


#!/bin/bash
# This is a simple shell script

echo "Hello!"

Save this in a file, say hello.sh, and make it executable:


chmod +x hello.sh
./hello.sh

The script outputs:


Hello!

Positional parameters and flags

Positional parameters allow you to pass arguments to your script:


#!/bin/bash
# Script to greet users

name=$1
role=$2
echo "Hello, $name! You are a $role."

Running this script as ./greet.sh Alice Engineer produces:


Hello, Alice! You are a Engineer.

For more robust scripts, use flags and the getopts utility:


#!/bin/bash
# Script with flags

while getopts n:r: flag
do
    case "${flag}" in
        n) name=${OPTARG};;
        r) role=${OPTARG};;
    esac
done

echo "Hello, $name! You are a $role."

Run it as:


./greet.sh -n Alice -r Engineer

Conditionals and loops for data pipeline tasks

Conditional statements and loops are critical for building dynamic pipelines. For instance, to preprocess multiple datasets:


#!/bin/bash
# Preprocessing multiple datasets

for file in data/*.csv
do
    if [[ -f "$file" ]]; then
        echo "Processing $file"
        # Example: converting to lowercase
        awk '{ print tolower($0) }' "$file" > "${file%.csv}_processed.csv"
    fi
done

Automating data preprocessing and scheduling

A common task is scheduling a preprocessing script to run daily. First, write your script (e.g., preprocess.sh) and add a cron job:


crontab -e

Add the following line to run the script at midnight:


0 0 * * * /path/to/preprocess.sh

Use cron to schedule model training or data sync jobs, ensuring your pipelines stay up-to-date without manual intervention.

Organizing file systems for datasets

Shell scripting is excellent for organizing datasets:


#!/bin/bash
# Organizing datasets

mkdir -p datasets/{raw,processed}
mv *.csv datasets/raw/

This creates a directory structure and moves files accordingly, ensuring your project's file system is clean and logical.

Automating environment setup

Data science projects often require setting up environments with dependencies. Automate this with a shell script:


#!/bin/bash
# Setting up environment

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
echo "Environment setup complete."

Debugging and optimizing scripts

Debugging scripts is simplified with the -x flag:


bash -x your_script.sh

This prints each command before execution, allowing you to trace issues. To optimize, avoid unnecessary loops and leverage efficient utilities like awk or parallel.

Integrating scripts with Python

Combine the power of shell scripting and Python for seamless workflows. For example:


#!/bin/bash
# Calling Python from shell script

python <<EOF
import os
print("Running Python code from shell script")
EOF

This enables you to leverage Python's rich ecosystem within a Linux-based automation pipeline.

Command line utils for data scientists

Linux offers a treasure trove of command-line utilities that simplify data processing, performance monitoring, and file management. Let's explore some indispensable tools.

Data processing tools

grep: Extract patterns using regular expressions. For instance, finding rows with specific keywords in a dataset:


grep -E "error|failure" logs.txt > filtered_logs.txt

awk: A powerful tool for extracting and transforming data. To extract the second column of a CSV file:


awk -F, '{ print $2 }' dataset.csv > column2.txt

Perform transformations inline:


awk -F, '{ OFS=","; $3=$3*100; print }' dataset.csv > updated_dataset.csv

sed: Edit files in place. Replace all instances of "old" with "new":


sed -i 's/old/new/g' file.txt

sort and uniq: Deduplicate and sort data. For example, count unique values in a column:


awk -F, '{ print $1 }' dataset.csv | sort | uniq -c

Performance monitoring and process management

top/htop: Monitor system resource usage. While top is minimalistic, htop offers an interactive interface with sortable columns and process management features.
nmon: Provides in-depth system performance metrics, including CPU, memory, disk I/O, and network usage. Useful for debugging performance bottlenecks in training.
ps: Lists running processes. To find a specific process:


ps aux | grep python

File compression and transfer

tar and gzip: Compress datasets:


tar -czvf dataset.tar.gz datasets/

scp: Transfer files securely:


scp dataset.tar.gz user@remote_host:/path/to/destination

rsync: Synchronize directories efficiently:


rsync -avz datasets/ user@remote_host:/backup/datasets/

This tool is especially valuable for backing up or syncing large datasets across servers.

Example: combining utils for workflow automation

Imagine a scenario where you preprocess a dataset, compress it, and transfer it to a remote server. You can script the entire process:


#!/bin/bash
# Data processing pipeline

grep -E "pattern" raw_data.txt > filtered_data.txt
awk -F, '{ print $2 }' filtered_data.txt > column2.txt
sort column2.txt | uniq > unique_values.txt

tar -czvf processed_data.tar.gz unique_values.txt
scp processed_data.tar.gz user@remote_host:/data/

Mastering these tools elevates your ability to manage data science workflows efficiently in a Linux environment.

Remote workflows

Remote workflows are at the heart of many data scientists' and ML engineers' daily routines. Whether you're connecting to a high-performance compute cluster, managing a remote server for experiments, or syncing files across systems, SSH (Secure Shell) is an essential tool.

Basics of SSH

At its core, SSH provides a secure channel over an unsecured network. When you execute:


ssh user@remote-server

you're establishing an encrypted connection to the remote server under the specified username. The server's identity is verified via its host key, and your access is authenticated using a password or a private key.

Setting up and configuring keys

SSH keys streamline the authentication process by replacing passwords with cryptographic key pairs:

Generate a key pair on your local machine:


ssh-keygen -t ed25519 -C "your_email@example.com"

This generates two files: ~/.ssh/id_ed25519 (private key) and ~/.ssh/id_ed25519.pub (public key).

Copy the public key to the remote server:


ssh-copy-id user@remote-server

This appends your public key to ~/.ssh/authorized_keys on the server.

Test the connection:


ssh user@remote-server

If successful, no password will be required.

Managing multiple SSH connections

When juggling multiple servers, managing connection details manually becomes tedious. Enter the ~/.ssh/config file:


Host myserver
    HostName remote-server
    User user
    IdentityFile ~/.ssh/id_ed25519
    Port 22

With this configuration, connecting is as simple as:


ssh myserver

This setup supports aliases, alternative ports, and different key files for various servers, greatly simplifying workflows.

Port forwarding

Port forwarding enables you to securely access services running on a remote server. For instance, to run a Jupyter Notebook:

Start the notebook on the remote server:


jupyter notebook --no-browser --port=8889

Forward the port to your local machine:


ssh -L 8888:localhost:8889 user@remote-server

Open http://localhost:8888 in your browser to access the notebook.

File transfers and syncing

Transferring files between systems is a common task. The scp command provides a straightforward method:


scp local_file user@remote-server:/path/to/destination

For more advanced use cases, like syncing directories, rsync is invaluable:


rsync -avh local_directory/ user@remote-server:/remote_directory/

This command preserves metadata and only transfers modified files, making it efficient for large datasets.

Tunneling and port forwarding for Jupyter notebooks

SSH tunneling securely forwards traffic between systems. To create a tunnel for JupyterLab:

Open the tunnel:


ssh -L 8080:localhost:8888 user@remote-server

Open your browser and navigate to http://localhost:8080.

This approach ensures your local machine securely communicates with the remote server without exposing sensitive ports.

Package management and environment setup

Linux's package managers provide a robust ecosystem for managing software, libraries, and dependencies. For ML and data science, this ensures consistency and efficiency.

Pacman and AUR

On Arch Linux and its derivatives, pacman is the default package manager. It handles system updates and installs with speed and simplicity:


sudo pacman -Syu  # Synchronize and update system

The Arch User Repository (AUR) extends pacman with community-maintained packages, crucial for accessing niche libraries and tools.

Use pacman to search for packages:


pacman -Ss python

To install:


sudo pacman -S python

To remove unused packages:


sudo pacman -Rns package_name

To clean the package cache:


sudo paccache -r

Paru, an AUR helper

Tools like paru simplify managing AUR packages. To install paru itself:


git clone https://aur.archlinux.org/paru.git
cd paru
makepkg -si

Then, install AUR packages:


paru -S package_name

Setting up Python environments

Python's flexibility is both a blessing and a curse. Proper environment management prevents dependency conflicts:

venv: Built into Python, it's ideal for lightweight projects:


python -m venv myenv
source myenv/bin/activate
pip install numpy pandas

Conda: For larger, more complex environments:


conda create -n myenv python=3.9 numpy pandas
conda activate myenv

Conda handles non-Python dependencies, making it indispensable for ML workflows.

System-wide and user-specific installations

Use pip install --user to avoid system-wide conflicts:


pip install --user scikit-learn

For system-wide installations, ensure virtual environments isolate dependencies.

GPU setup for deep learning

Verifying GPU drivers and CUDA setup is crucial:

Check GPU availability:


nvidia-smi

Install drivers:


sudo pacman -S nvidia

Install CUDA:


sudo pacman -S cuda

Verify CUDA installation:


nvcc --version

With these tools configured, your Linux system is primed for efficient data science and machine learning workflows.

Filesystem permissions and security

Working with filesystem permissions in Linux is essential for managing data securely and effectively. Permissions are foundational for protecting sensitive data and ensuring that processes and users operate within the intended scope. Linux's permission model revolves around three entities: the user, the group, and others. Each file and directory has a permission set that dictates what each entity can read (r), write (w), and execute (x).

Controlling file permissions

The chmod command is the go-to tool for changing permissions. Permissions are represented either symbolically (e.g., rwx) or numerically (e.g., 755). The numeric representation uses octal notation:

4: Read
2: Write
1: Execute

Combine these values to set permissions. For example, chmod 750 my_script.sh grants full permissions to the owner (7 = 4+2+1), read and execute to the group (5 = 4+1), and no permissions to others (0).

Symbolic modifications provide more granular control:


chmod u+x my_script.sh   # Adds execute permission to the user.
chmod g-w my_script.sh   # Removes write permission from the group.
chmod o=r my_script.sh   # Sets read-only for others.

Changing ownership

Ownership in Linux defines who controls a file. The chown command allows you to change the user and/or group owner:


chown alice:developers my_data.csv

Here, Alice becomes the owner, and the developers group is assigned. Changing ownership can be extended recursively with the -R flag to affect entire directories:


chown -R bob:researchers /project

Default permission mask

The umask command sets default permissions for newly created files and directories. A common default umask value is 022, which removes write permissions for group and others:


umask 022

Files are created with a default mode of 666 (read/write for all) minus the umask. Directories default to 777 minus the umask. To ensure new files are private:


umask 077

Managing user groups and privileges

Linux groups allow you to manage access rights for multiple users collectively.

groupadd: Creates new groups:


groupadd analysts

usermod: Adds users to groups or modifies existing ones:


usermod -aG analysts charlie  # Adds Charlie to the analysts group.

List user groups with:


groups charlie

Using sudo effectively

The sudo command allows users to execute commands with elevated privileges. Best practices include:

Minimal permissions: Grant only the required privileges by editing the /etc/sudoers file or using visudo for safety.
Avoid direct root login: This ensures better auditing and reduces risks.
Time-limited sudo: Some organizations use sudo configurations that require re-authentication after a timeout.

Protecting sensitive files

Sensitive files, such as API keys or private datasets, can be encrypted using gpg (GNU Privacy Guard). Encryption ensures that even if someone gains access to the file, they cannot read it without the decryption key.

Encrypt a file:


gpg -c secrets.txt  # Prompts for a passphrase and encrypts the file.

Decrypt a file:


gpg secrets.txt.gpg

Advanced users can use asymmetric encryption for sharing secrets securely:


gpg --encrypt --recipient alice@example.com shared_data.csv

Troubleshooting and system maintenance

When things go wrong on a Linux system — and they inevitably will — knowing how to troubleshoot effectively can save hours or days of frustration. For ML workloads, this can mean recovering lost compute time or diagnosing performance bottlenecks.

Interpreting Linux logs

Linux logs are invaluable for debugging issues. The journalctl command interacts with the systemd journal and provides an intuitive way to view logs:


journalctl -u apache2.service   # Logs for a specific service.
journalctl --since "2 hours ago"   # Logs from the past 2 hours.

Older log files are stored in /var/log. Key logs include:

/var/log/syslog: General system messages.
/var/log/auth.log: Authentication and sudo-related logs.
/var/log/kern.log: Kernel-related messages.

Recovering from boot errors

Boot errors can cripple productivity. Using a live USB or recovery disk, you can mount the root filesystem and chroot into it:


mount /dev/sda1 /mnt
mount --bind /dev /mnt/dev
chroot /mnt

From here, you can repair configurations, reinstall packages, or reconfigure the bootloader.

Monitoring disk health and performance

Monitoring disk health is critical for ensuring data integrity, especially when handling large datasets. Use smartctl from the smartmontools package:


smartctl -a /dev/sda   # View SMART health information.

Disk performance can be analyzed using iostat:


iostat -x 1

This provides metrics like disk utilization, read/write throughput, and queue lengths.

Managing out-of-memory errors

Out-of-memory (OOM) errors are common when running large ML workloads. The Linux kernel's OOM-killer terminates processes when memory runs low. To avoid this:

Monitor memory usage: Tools like htop and free provide real-time memory stats.
Create swap files: Swap acts as overflow memory when RAM is full:


fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

Tune OOM behavior: Adjust the oom_score_adj to deprioritize killing critical processes:


echo -1000 > /proc/$(pidof my_ml_process)/oom_score_adj

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content