

🎓 2/167
This post is a part of the Essentials educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.
I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!
Passive-aggressive system debugging
Linux has become an indispensable tool in the arsenal of data scientists and machine learning engineers. Why? Because it offers unparalleled control, flexibility, and efficiency when dealing with high-performance computing tasks, massive datasets, and custom workflows. When you're training complex models, running data pipelines, or deploying machine learning systems at scale, Linux gives you the power to optimize every layer of your workflow. Whether it's handling multi-threaded computations or tweaking system-level configurations to squeeze out every bit of performance, Linux provides the freedom to do so.
Unlike other operating systems, Linux excels in its native compatibility with virtually every tool and framework you'll use, from TensorFlow and PyTorch to distributed computing systems like Apache Spark. Open-source ecosystems thrive here, giving you access to cutting-edge research and community support. Not to mention, the server environments you'll deploy your models to — whether in AWS, Google Cloud, or Azure — run on Linux, so having fluency in this OS is essential.
Why Manjaro? For this blog, we'll use Manjaro Linux as our example distribution. Built on Arch Linux, Manjaro provides the best of both worlds: a cutting-edge, rolling-release system combined with user-friendly tools and stability. Manjaro's extensive package manager, pacman
, along with its AUR (Arch User Repository), ensures that every library, tool, or dependency you need is readily available. Plus, its minimalist philosophy makes it an excellent playground for learning Linux without unnecessary distractions.
Command line
The command line is the beating heart of Linux. While graphical interfaces exist, they're often bypassed by professionals for the raw power and automation capabilities of the command line. Mastering it is akin to learning a new programming language — initially challenging but infinitely rewarding.
File system hierarchy
Linux organizes its files into a structured tree known as the file system hierarchy. Here are some key directories you'll interact with frequently:
/etc
: Configuration files for the system and installed software. For example, if you need to tweak the behavior of your Python installation or manage system-wide environment variables, this is where you'll look./var
: Stores variable data such as logs, caches, and runtime files. Logs generated by tools likecron
orsystemd
are invaluable for debugging./opt
: Optional third-party software. If you're running custom ML tools that aren't in the standard repositories, they might live here./usr/local
: User-installed software. When you compile and install from source, this is the default destination.
Understanding this structure lets you intuitively locate configuration files, log outputs, and software installations.
Basic commands
While the Linux command line is vast, let's start with a core set of commands you'll use daily:
ls
: Lists files and directories. Use flags like-l
for detailed info or-a
to show hidden files.
ls -la /etc
cd
: Changes the working directory.
cd /usr/local/bin
pwd
: Prints the current working directory.
pwd
cp
: Copies files or directories. Use-r
for recursive copies.
cp -r /data/old /data/new
mv
: Moves (or renames) files.
mv model.py old_model.py
rm
: Removes files or directories. Be cautious withrm -rf
to avoid unintentional deletions.
rm -rf /tmp/unnecessary_files
cat
: Concatenates and displays file content.
cat logs.txt
less
: Views file content page by page, useful for large logs.
less system.log
Recursive operations
When working with datasets, you'll often need to perform operations across multiple files or directories. This is where find
and xargs
shine:
- Find all
.csv
files and print their paths:
find /data -name "*.csv"
- Delete all
.tmp
files (use caution):
find /data -name "*.tmp" | xargs rm
Batch renaming
Renaming hundreds of files manually is tedious. The rename
command simplifies this:
- Replace
old
withnew
in file names:
rename 's/old/new/' *.txt
Disk usage analysis
When your disk fills up with datasets and checkpoints, disk usage tools become essential:
du
: Summarizes file and directory sizes. Use-h
for human-readable output.
du -h /data
df
: Shows available disk space on mounted filesystems.
df -h
ncdu
: A terminal-based disk usage analyzer that's both visual and interactive. Install it with:
sudo pacman -S ncdu
Misc
- Use tab completion to save time typing paths or commands.
- Leverage history (
history
command or pressingCtrl+R
) to recall previous commands. - Alias frequently used commands in your
~/.bashrc
or~/.zshrc
. For example:
alias ll='ls -la'
Bash and Zsh
Switching from bash
to zsh
on Manjaro is a straightforward process and can significantly enhance your shell experience. While bash
is a robust default, zsh
introduces advanced features like better autocompletion, customizable prompts, and plugin support.
To switch to zsh
, first ensure it is installed:
sudo pacman -S zsh
Then set it as your default shell:
chsh -s /bin/zsh
Restart your terminal to activate the change. You'll immediately notice a difference in the shell's behavior and aesthetics.
Setting up a productive shell environment
One of the most popular tools for enhancing zsh
is oh-my-zsh
, a framework that makes customization effortless. Install it with:
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
Once installed, you can activate plugins tailored for data science, such as:
git
: For seamless Git integration.zsh-autosuggestions
: Provides command suggestions as you type.zsh-syntax-highlighting
: Adds syntax highlighting for commands.
To enable these, edit your ~/.zshrc
:
plugins=(git zsh-autosuggestions zsh-syntax-highlighting)
Both ~/.bashrc
and ~/.zshrc
are startup files where you can define aliases, environment variables, and custom functions. Examples:
- Add Python virtual environments:
export WORKON_HOME=$HOME/.virtualenvs
source /usr/local/bin/virtualenvwrapper.sh
- Define shortcuts for common tasks:
alias activate="source venv/bin/activate"
alias jn="jupyter notebook"
Command chaining and redirection
Mastering shell command chaining and redirection is crucial for streamlining workflows:
- Command chaining:
&&
: Executes the second command only if the first succeeds.
mkdir new_folder && cd new_folder
||
: Executes the second command only if the first fails.
mkdir existing_folder || echo "Folder already exists."
;
: Executes commands sequentially, regardless of success or failure.
echo "Hello"; echo "World"
- Piping (
|
): Sends the output of one command as input to another.
cat large_file.txt | grep "keyword"
- Redirection:
>
: Overwrites a file with command output.
echo "Data" > file.txt
>>
: Appends output to a file.
echo "More Data" >> file.txt
<
: Reads input from a file.
wc -l < file.txt
Writing efficient one-liners
Shell one-liners are powerful tools for quick data manipulations. Here are a few examples:
- Count unique words in a file:
cat file.txt | tr ' ' '
' | sort | uniq -c
- Extract the first column of a CSV:
cut -d',' -f1 data.csv
- Monitor GPU usage (NVIDIA):
watch -n 1 nvidia-smi
Vim and Neovim
For quick edits or comprehensive development tasks, mastering vim
or neovim
(nvim
) is a game-changer.
Basic commands
Vim is modal, meaning it operates in different modes:
- Insert Mode: For editing text (
i
to enter,Esc
to exit). - Normal Mode: For navigating and executing commands.
- Command Mode: For running commands (
:
).
Some essential commands:
- Save and quit:
:wq
- Delete a line:
dd
- Undo and redo:
u (undo), Ctrl+r (redo)
- Search for text:
/keyword
Neovim enhances Vim with modern features like asynchronous plugins and an embedded Lua scripting interface. Install it on Manjaro with:
sudo pacman -S neovim
Configuring Vim/Neovim for Python
- Install
vim-plug
for plugin management:
curl -fLo ~/.local/share/nvim/site/autoload/plug.vim --create-dirs https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim
- Add Python-specific plugins to
~/.config/nvim/init.vim
:
call plug#begin('~/.vim/plugged')
Plug 'dense-analysis/ale' " Linting
Plug 'vim-python/python-syntax' " Python syntax
call plug#end()
- Configure Jupyter Notebook support with plugins like
vim-slime
.
Misc
- Macros: Record a sequence of commands with
q
, execute them with@
. - Search and Replace:
:%s/old/new/g
- Navigation with Markers: Use
m
to mark locations, navigate with'
.
Shell scripting
Shell scripting is an indispensable skill for data scientists and machine learning engineers who work with Linux. It bridges the gap between manual work and automated workflows, saving countless hours and reducing errors. A shell script is essentially a series of commands written in a text file, executed by a shell interpreter like bash
or zsh
. Let's break down some key components and practical examples.
Writing simple scripts
A shell script starts with a shebang (#!
) followed by the path to the shell interpreter:
#!/bin/bash
# This is a simple shell script
echo "Hello!"
Save this in a file, say hello.sh
, and make it executable:
chmod +x hello.sh
./hello.sh
The script outputs:
Hello!
Positional parameters and flags
Positional parameters allow you to pass arguments to your script:
#!/bin/bash
# Script to greet users
name=$1
role=$2
echo "Hello, $name! You are a $role."
Running this script as ./greet.sh Alice Engineer
produces:
Hello, Alice! You are a Engineer.
For more robust scripts, use flags and the getopts
utility:
#!/bin/bash
# Script with flags
while getopts n:r: flag
do
case "${flag}" in
n) name=${OPTARG};;
r) role=${OPTARG};;
esac
done
echo "Hello, $name! You are a $role."
Run it as:
./greet.sh -n Alice -r Engineer
Conditionals and loops for data pipeline tasks
Conditional statements and loops are critical for building dynamic pipelines. For instance, to preprocess multiple datasets:
#!/bin/bash
# Preprocessing multiple datasets
for file in data/*.csv
do
if [[ -f "$file" ]]; then
echo "Processing $file"
# Example: converting to lowercase
awk '{ print tolower($0) }' "$file" > "${file%.csv}_processed.csv"
fi
done
Automating data preprocessing and scheduling
A common task is scheduling a preprocessing script to run daily. First, write your script (e.g., preprocess.sh
) and add a cron
job:
crontab -e
Add the following line to run the script at midnight:
0 0 * * * /path/to/preprocess.sh
Use cron
to schedule model training or data sync jobs, ensuring your pipelines stay up-to-date without manual intervention.
Organizing file systems for datasets
Shell scripting is excellent for organizing datasets:
#!/bin/bash
# Organizing datasets
mkdir -p datasets/{raw,processed}
mv *.csv datasets/raw/
This creates a directory structure and moves files accordingly, ensuring your project's file system is clean and logical.
Automating environment setup
Data science projects often require setting up environments with dependencies. Automate this with a shell script:
#!/bin/bash
# Setting up environment
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
echo "Environment setup complete."
Debugging and optimizing scripts
Debugging scripts is simplified with the -x
flag:
bash -x your_script.sh
This prints each command before execution, allowing you to trace issues. To optimize, avoid unnecessary loops and leverage efficient utilities like awk
or parallel
.
Integrating scripts with Python
Combine the power of shell scripting and Python for seamless workflows. For example:
#!/bin/bash
# Calling Python from shell script
python <<EOF
import os
print("Running Python code from shell script")
EOF
This enables you to leverage Python's rich ecosystem within a Linux-based automation pipeline.
Command line utils for data scientists
Linux offers a treasure trove of command-line utilities that simplify data processing, performance monitoring, and file management. Let's explore some indispensable tools.
Data processing tools
grep
: Extract patterns using regular expressions. For instance, finding rows with specific keywords in a dataset:
grep -E "error|failure" logs.txt > filtered_logs.txt
awk
: A powerful tool for extracting and transforming data. To extract the second column of a CSV file:
awk -F, '{ print $2 }' dataset.csv > column2.txt
Perform transformations inline:
awk -F, '{ OFS=","; $3=$3*100; print }' dataset.csv > updated_dataset.csv
sed
: Edit files in place. Replace all instances of "old" with "new":
sed -i 's/old/new/g' file.txt
sort
anduniq
: Deduplicate and sort data. For example, count unique values in a column:
awk -F, '{ print $1 }' dataset.csv | sort | uniq -c
Performance monitoring and process management
-
top
/htop
: Monitor system resource usage. Whiletop
is minimalistic,htop
offers an interactive interface with sortable columns and process management features. -
nmon
: Provides in-depth system performance metrics, including CPU, memory, disk I/O, and network usage. Useful for debugging performance bottlenecks in training. -
ps
: Lists running processes. To find a specific process:
ps aux | grep python
File compression and transfer
tar
andgzip
: Compress datasets:
tar -czvf dataset.tar.gz datasets/
scp
: Transfer files securely:
scp dataset.tar.gz user@remote_host:/path/to/destination
rsync
: Synchronize directories efficiently:
rsync -avz datasets/ user@remote_host:/backup/datasets/
This tool is especially valuable for backing up or syncing large datasets across servers.
Example: combining utils for workflow automation
Imagine a scenario where you preprocess a dataset, compress it, and transfer it to a remote server. You can script the entire process:
#!/bin/bash
# Data processing pipeline
grep -E "pattern" raw_data.txt > filtered_data.txt
awk -F, '{ print $2 }' filtered_data.txt > column2.txt
sort column2.txt | uniq > unique_values.txt
tar -czvf processed_data.tar.gz unique_values.txt
scp processed_data.tar.gz user@remote_host:/data/
Mastering these tools elevates your ability to manage data science workflows efficiently in a Linux environment.
Remote workflows
Remote workflows are at the heart of many data scientists' and ML engineers' daily routines. Whether you're connecting to a high-performance compute cluster, managing a remote server for experiments, or syncing files across systems, SSH (Secure Shell) is an essential tool.
Basics of SSH
At its core, SSH provides a secure channel over an unsecured network. When you execute:
ssh user@remote-server
you're establishing an encrypted connection to the remote server under the specified username. The server's identity is verified via its host key, and your access is authenticated using a password or a private key.
Setting up and configuring keys
SSH keys streamline the authentication process by replacing passwords with cryptographic key pairs:
- Generate a key pair on your local machine:
ssh-keygen -t ed25519 -C "your_email@example.com"
This generates two files: ~/.ssh/id_ed25519
(private key) and ~/.ssh/id_ed25519.pub
(public key).
- Copy the public key to the remote server:
ssh-copy-id user@remote-server
This appends your public key to ~/.ssh/authorized_keys
on the server.
- Test the connection:
ssh user@remote-server
If successful, no password will be required.
Managing multiple SSH connections
When juggling multiple servers, managing connection details manually becomes tedious. Enter the ~/.ssh/config
file:
Host myserver
HostName remote-server
User user
IdentityFile ~/.ssh/id_ed25519
Port 22
With this configuration, connecting is as simple as:
ssh myserver
This setup supports aliases, alternative ports, and different key files for various servers, greatly simplifying workflows.
Port forwarding
Port forwarding enables you to securely access services running on a remote server. For instance, to run a Jupyter Notebook:
- Start the notebook on the remote server:
jupyter notebook --no-browser --port=8889
- Forward the port to your local machine:
ssh -L 8888:localhost:8889 user@remote-server
- Open
http://localhost:8888
in your browser to access the notebook.
File transfers and syncing
Transferring files between systems is a common task. The scp
command provides a straightforward method:
scp local_file user@remote-server:/path/to/destination
For more advanced use cases, like syncing directories, rsync
is invaluable:
rsync -avh local_directory/ user@remote-server:/remote_directory/
This command preserves metadata and only transfers modified files, making it efficient for large datasets.
Tunneling and port forwarding for Jupyter notebooks
SSH tunneling securely forwards traffic between systems. To create a tunnel for JupyterLab:
- Open the tunnel:
ssh -L 8080:localhost:8888 user@remote-server
- Open your browser and navigate to
http://localhost:8080
.
This approach ensures your local machine securely communicates with the remote server without exposing sensitive ports.
Package management and environment setup
Linux's package managers provide a robust ecosystem for managing software, libraries, and dependencies. For ML and data science, this ensures consistency and efficiency.
Pacman and AUR
On Arch Linux and its derivatives, pacman
is the default package manager. It handles system updates and installs with speed and simplicity:
sudo pacman -Syu # Synchronize and update system
The Arch User Repository (AUR) extends pacman
with community-maintained packages, crucial for accessing niche libraries and tools.
Use pacman
to search for packages:
pacman -Ss python
To install:
sudo pacman -S python
To remove unused packages:
sudo pacman -Rns package_name
To clean the package cache:
sudo paccache -r
Paru, an AUR helper
Tools like paru
simplify managing AUR packages. To install paru
itself:
git clone https://aur.archlinux.org/paru.git
cd paru
makepkg -si
Then, install AUR packages:
paru -S package_name
Setting up Python environments
Python's flexibility is both a blessing and a curse. Proper environment management prevents dependency conflicts:
- venv: Built into Python, it's ideal for lightweight projects:
python -m venv myenv
source myenv/bin/activate
pip install numpy pandas
- Conda: For larger, more complex environments:
conda create -n myenv python=3.9 numpy pandas
conda activate myenv
Conda handles non-Python dependencies, making it indispensable for ML workflows.
System-wide and user-specific installations
Use pip install --user
to avoid system-wide conflicts:
pip install --user scikit-learn
For system-wide installations, ensure virtual environments isolate dependencies.
GPU setup for deep learning
Verifying GPU drivers and CUDA setup is crucial:
- Check GPU availability:
nvidia-smi
- Install drivers:
sudo pacman -S nvidia
- Install CUDA:
sudo pacman -S cuda
- Verify CUDA installation:
nvcc --version
With these tools configured, your Linux system is primed for efficient data science and machine learning workflows.
Filesystem permissions and security
Working with filesystem permissions in Linux is essential for managing data securely and effectively. Permissions are foundational for protecting sensitive data and ensuring that processes and users operate within the intended scope. Linux's permission model revolves around three entities: the user, the group, and others. Each file and directory has a permission set that dictates what each entity can read (r), write (w), and execute (x).
Controlling file permissions
The chmod
command is the go-to tool for changing permissions. Permissions are represented either symbolically (e.g., rwx
) or numerically (e.g., 755
). The numeric representation uses octal notation:
- 4: Read
- 2: Write
- 1: Execute
Combine these values to set permissions. For example, chmod 750 my_script.sh
grants full permissions to the owner (7 = 4+2+1), read and execute to the group (5 = 4+1), and no permissions to others (0).
Symbolic modifications provide more granular control:
chmod u+x my_script.sh # Adds execute permission to the user.
chmod g-w my_script.sh # Removes write permission from the group.
chmod o=r my_script.sh # Sets read-only for others.
Changing ownership
Ownership in Linux defines who controls a file. The chown
command allows you to change the user and/or group owner:
chown alice:developers my_data.csv
Here, Alice becomes the owner, and the developers
group is assigned. Changing ownership can be extended recursively with the -R
flag to affect entire directories:
chown -R bob:researchers /project
Default permission mask
The umask
command sets default permissions for newly created files and directories. A common default umask
value is 022
, which removes write permissions for group and others:
umask 022
Files are created with a default mode of 666
(read/write for all) minus the umask
. Directories default to 777
minus the umask
. To ensure new files are private:
umask 077
Managing user groups and privileges
Linux groups allow you to manage access rights for multiple users collectively.
- groupadd: Creates new groups:
groupadd analysts
- usermod: Adds users to groups or modifies existing ones:
usermod -aG analysts charlie # Adds Charlie to the analysts group.
List user groups with:
groups charlie
Using sudo effectively
The sudo
command allows users to execute commands with elevated privileges. Best practices include:
- Minimal permissions: Grant only the required privileges by editing the
/etc/sudoers
file or usingvisudo
for safety. - Avoid direct root login: This ensures better auditing and reduces risks.
- Time-limited sudo: Some organizations use
sudo
configurations that require re-authentication after a timeout.
Protecting sensitive files
Sensitive files, such as API keys or private datasets, can be encrypted using gpg
(GNU Privacy Guard). Encryption ensures that even if someone gains access to the file, they cannot read it without the decryption key.
Encrypt a file:
gpg -c secrets.txt # Prompts for a passphrase and encrypts the file.
Decrypt a file:
gpg secrets.txt.gpg
Advanced users can use asymmetric encryption for sharing secrets securely:
gpg --encrypt --recipient alice@example.com shared_data.csv
Troubleshooting and system maintenance
When things go wrong on a Linux system — and they inevitably will — knowing how to troubleshoot effectively can save hours or days of frustration. For ML workloads, this can mean recovering lost compute time or diagnosing performance bottlenecks.
Interpreting Linux logs
Linux logs are invaluable for debugging issues. The journalctl
command interacts with the systemd journal and provides an intuitive way to view logs:
journalctl -u apache2.service # Logs for a specific service.
journalctl --since "2 hours ago" # Logs from the past 2 hours.
Older log files are stored in /var/log
. Key logs include:
- /var/log/syslog: General system messages.
- /var/log/auth.log: Authentication and sudo-related logs.
- /var/log/kern.log: Kernel-related messages.
Recovering from boot errors
Boot errors can cripple productivity. Using a live USB or recovery disk, you can mount the root filesystem and chroot into it:
mount /dev/sda1 /mnt
mount --bind /dev /mnt/dev
chroot /mnt
From here, you can repair configurations, reinstall packages, or reconfigure the bootloader.
Monitoring disk health and performance
Monitoring disk health is critical for ensuring data integrity, especially when handling large datasets. Use smartctl
from the smartmontools
package:
smartctl -a /dev/sda # View SMART health information.
Disk performance can be analyzed using iostat
:
iostat -x 1
This provides metrics like disk utilization, read/write throughput, and queue lengths.
Managing out-of-memory errors
Out-of-memory (OOM) errors are common when running large ML workloads. The Linux kernel's OOM-killer terminates processes when memory runs low. To avoid this:
- Monitor memory usage: Tools like
htop
andfree
provide real-time memory stats. - Create swap files: Swap acts as overflow memory when RAM is full:
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
- Tune OOM behavior: Adjust the
oom_score_adj
to deprioritize killing critical processes:
echo -1000 > /proc/$(pidof my_ml_process)/oom_score_adj