7. Python Basics and Best Practices

Python is the primary language for scientific computing in the CASA group. This guide covers the fundamentals of Python coding, managing packages with Conda, and introduces libraries commonly used in atmospheric science.

7.1. Python Basics

Running Python

You can run Python in several ways:

python                          # Interactive Python shell
python script.py                # Run a Python file
python -c "print('hello')"      # Execute Python code directly
ipython                         # Enhanced interactive shell (if installed)
jupyter notebook                # Jupyter notebook environment

Basic Syntax

Variables and assignments:

name = "Alice"                  # String
age = 25                        # Integer
temperature = 23.5              # Float
is_valid = True                 # Boolean
items = [1, 2, 3]              # List
coords = (10, 20)              # Tuple
data = {"name": "Alice", "age": 25}  # Dictionary

Print output:

print("Hello, world!")
print(f"Name: {name}, Age: {age}")  # f-strings (modern approach)
print("Name: {}, Age: {}".format(name, age))  # Older approach

Control Flow

If statements:

if temperature > 30:
    print("Hot!")
elif temperature > 20:
    print("Warm")
else:
    print("Cold")

Loops:

# For loop
for i in range(5):
    print(i)  # Prints 0, 1, 2, 3, 4

# For loop over list
for item in items:
    print(item)

# While loop
count = 0
while count < 5:
    print(count)
    count += 1

# List comprehension (powerful!)
squares = [x**2 for x in range(5)]  # [0, 1, 4, 9, 16]

Functions

Define and use functions:

def greet(name):
    """Greet someone by name."""
    return f"Hello, {name}!"

result = greet("Alice")
print(result)  # Hello, Alice!

# Functions with default arguments
def add(a, b=0):
    """Add two numbers."""
    return a + b

print(add(5))        # 5 (b defaults to 0)
print(add(5, 3))     # 8

# Functions with multiple return values
def get_min_max(numbers):
    return min(numbers), max(numbers)

min_val, max_val = get_min_max([1, 5, 3, 9, 2])

Working with Lists and Dictionaries

Lists:

fruits = ["apple", "banana", "cherry"]

# Access by index
print(fruits[0])            # apple
print(fruits[-1])           # cherry (last item)

# Slicing
print(fruits[0:2])          # ["apple", "banana"]
print(fruits[1:])           # ["banana", "cherry"]

# Methods
fruits.append("date")       # Add item
fruits.remove("banana")     # Remove specific item
length = len(fruits)        # Get length
fruits.sort()               # Sort in place

Dictionaries:

person = {"name": "Alice", "age": 25, "city": "Cardiff"}

# Access by key
print(person["name"])       # Alice

# Add or modify
person["age"] = 26
person["email"] = "alice@example.com"

# Methods
keys = person.keys()        # Get all keys
values = person.values()    # Get all values
for key, value in person.items():
    print(f"{key}: {value}")

7.2. Managing Python Environments with Conda

What is Conda?

Conda is a package and environment manager. It lets you: - Install Python packages easily - Create isolated environments for different projects - Manage package versions to avoid conflicts - Share environments with colleagues

Installing Conda

Conda comes with Anaconda or Miniconda. For Falcon:

module load anaconda3           # Load Conda on Falcon
conda --version                 # Check installation

Creating Environments

Create a new environment:

conda create -n myproject python=3.9         # Python 3.9
conda create -n myproject python=3.10        # Python 3.10

Specify packages when creating:

conda create -n myproject python=3.9 numpy pandas matplotlib

Activating/Deactivating Environments

Activate an environment:

conda activate myproject

You’ll see (myproject) in your terminal prompt when it’s active.

Deactivate:

conda deactivate

Installing and Updating Packages

Install a package:

conda install numpy              # Latest version
conda install numpy=1.21.0       # Specific version
conda install numpy pandas scipy # Multiple packages

Update a package:

conda update numpy

Remove a package:

conda remove numpy

Listing Packages

See installed packages:

conda list                       # All packages
conda list | grep numpy          # Search for a specific package

Sharing Environments

Export your environment to a file:

conda env export > environment.yml

Someone else can recreate it:

conda env create -f environment.yml

Or create from a minimal YAML file:

# Create a file called environment.yml
name: myproject
channels:
  - conda-forge
dependencies:
  - python=3.9
  - numpy
  - pandas
  - matplotlib
  - xarray
  - netcdf4

Then create it:

conda env create -f environment.yml

Useful Conda Commands

conda info                      # Show conda configuration
conda env list                  # List all environments
conda env remove -n myproject   # Delete an environment
conda search numpy              # Search for package versions
conda clean --all              # Clean cache (frees disk space)

7.3. Common Libraries

NumPy — Numerical Computing

NumPy provides arrays and mathematical functions. Most scientific Python builds on NumPy.

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
zeros = np.zeros((3, 3))           # 3x3 array of zeros
ones = np.ones((2, 4))             # 2x4 array of ones
range_arr = np.arange(0, 10, 2)   # [0 2 4 6 8]

# Array operations
arr2d = np.array([[1, 2], [3, 4], [5, 6]])
print(arr2d.shape)                  # (3, 2)
print(arr2d[0, 1])                  # 2 (element at row 0, col 1)

# Math operations (element-wise)
result = arr * 2                    # [2 4 6 8 10]
result = np.sin(arr)
result = np.exp(arr)

# Useful functions
mean = np.mean(arr)
std = np.std(arr)
max_val = np.max(arr)
min_val = np.min(arr)

# Reshaping
reshaped = arr.reshape(5, 1)
flattened = arr2d.flatten()

Pandas — Tabular Data

Pandas handles table-like data (like spreadsheets or databases). Essential for climate data analysis.

import pandas as pd

# Create a DataFrame (table)
data = {
    'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'temperature': [15.2, 16.1, 14.8],
    'humidity': [65, 70, 68]
}
df = pd.DataFrame(data)

# Access columns
temps = df['temperature']
print(temps[0])                     # 15.2

# Access rows
row = df.iloc[0]                    # First row as Series
row = df.loc[0]                     # Access by label

# Filtering
hot_days = df[df['temperature'] > 15]

# Statistics
print(df['temperature'].mean())      # 15.366...
print(df['temperature'].std())
print(df.describe())                # Summary statistics

# Read from file
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')

# Write to file
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx')

# Basic operations
df['new_column'] = df['temperature'] * 1.8 + 32  # Fahrenheit
df.drop('humidity', axis=1)         # Drop a column
df.dropna()                         # Remove rows with missing values

Matplotlib — Plotting

Create visualizations:

import matplotlib.pyplot as plt
import numpy as np

# Simple plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y, label='sin(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Sine Wave')
plt.legend()
plt.show()

# Multiple plots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

axes[0, 0].plot(x, np.sin(x))
axes[0, 0].set_title('sin(x)')

axes[0, 1].plot(x, np.cos(x))
axes[0, 1].set_title('cos(x)')

axes[1, 0].scatter(x[:20], y[:20])
axes[1, 0].set_title('scatter')

axes[1, 1].hist(np.random.normal(0, 1, 1000))
axes[1, 1].set_title('histogram')

plt.tight_layout()
plt.savefig('plots.png', dpi=300)
plt.show()

xarray — Labeled Multidimensional Data

Xarray is built on top of NumPy and designed for climate/weather data with named dimensions.

import xarray as xr
import numpy as np

# Create a DataArray (labeled array)
data = np.random.randn(10, 20)
da = xr.DataArray(
    data,
    dims=['lat', 'lon'],
    coords={'lat': np.arange(10), 'lon': np.arange(20)},
    name='temperature'
)

# Access by label (not just index!)
print(da.sel(lat=5, lon=10).values)  # Get value at lat=5, lon=10

# Slice by dimension
subset = da.sel(lat=slice(2, 7), lon=slice(5, 15))

# Read from NetCDF file (common format for climate data)
ds = xr.open_dataset('temperature.nc')
print(ds)                           # Show dataset info
print(ds.data_vars)                 # Show all variables
temps = ds['temperature']           # Get a variable

# Group and aggregate
mean_by_lat = da.mean(dim='lon')   # Average over longitude

# Save to NetCDF
da.to_netcdf('output.nc')

SciPy — Scientific Functions

Advanced mathematical functions:

import scipy.stats as stats
from scipy.optimize import minimize

# Statistics
p_value = stats.ttest_ind(group1, group2)  # t-test

# Optimization
def objective(x):
    return (x - 3)**2

result = minimize(objective, x0=0)
print(result.x)  # Minimum value

# Interpolation
from scipy.interpolate import interp1d
f = interp1d(x_data, y_data, kind='cubic')
y_interp = f(x_new)

7.4. Code Organization

Scripts vs. Functions

Bad: Everything in one file

# analysis.py
data = [1, 2, 3, 4, 5]
mean = sum(data) / len(data)
print(mean)

Good: Organize into functions

# analysis.py
def calculate_mean(data):
    """Calculate the mean of data."""
    return sum(data) / len(data)

def main():
    data = [1, 2, 3, 4, 5]
    mean = calculate_mean(data)
    print(mean)

if __name__ == "__main__":
    main()

Module Organization

For larger projects, organize code:

myproject/
├── data/
│   ├── raw/
│   └── processed/
├── src/
│   ├── __init__.py
│   ├── preprocessing.py
│   ├── analysis.py
│   └── plotting.py
├── scripts/
│   ├── process_data.py
│   └── run_analysis.py
└── environment.yml

Importing from Modules

In src/preprocessing.py:

def load_data(filename):
    """Load data from file."""
    import pandas as pd
    return pd.read_csv(filename)

def clean_data(df):
    """Remove missing values."""
    return df.dropna()

In scripts/process_data.py:

import sys
sys.path.insert(0, '../src')

from preprocessing import load_data, clean_data

df = load_data('../data/raw/input.csv')
df_clean = clean_data(df)
df_clean.to_csv('../data/processed/output.csv')

7.5. Best Practices

1. Use Virtual Environments

Never code in the base Python environment. Always use conda environments:

conda create -n myproject python=3.9
conda activate myproject
conda install numpy pandas matplotlib

2. Use Meaningful Variable Names

Bad:

x = [1, 2, 3]
y = sum(x) / len(x)

Good:

temperatures = [15.2, 16.1, 14.8]
mean_temp = sum(temperatures) / len(temperatures)

3. Write Docstrings

def calculate_anomaly(data, climatology):
    """
    Calculate anomalies from climatology.

    Parameters
    ----------
    data : array-like
        Data values
    climatology : float
        Climatological mean

    Returns
    -------
    anomaly : array-like
        Anomalies relative to climatology
    """
    return data - climatology

4. Handle Errors

try:
    df = pd.read_csv('data.csv')
except FileNotFoundError:
    print("Error: File not found!")
except Exception as e:
    print(f"Unexpected error: {e}")

5. Use Type Hints (Python 3.5+)

def add(a: float, b: float) -> float:
    """Add two numbers."""
    return a + b

def process_data(data: list[int]) -> dict[str, float]:
    """Process a list of integers."""
    return {
        'mean': sum(data) / len(data),
        'max': max(data),
        'min': min(data)
    }

6. Test Your Code

# test_analysis.py
from analysis import calculate_mean

def test_calculate_mean():
    assert calculate_mean([1, 2, 3, 4, 5]) == 3.0
    assert calculate_mean([10]) == 10.0
    print("All tests passed!")

if __name__ == "__main__":
    test_calculate_mean()

Run with:

python test_analysis.py

7.6. Common Workflows

Workflow 1: Data Processing

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('temperature_data.csv')

# Inspect
print(df.head())
print(df.info())
print(df.describe())

# Clean
df = df.dropna()
df['date'] = pd.to_datetime(df['date'])

# Process
df['anomaly'] = df['temperature'] - df['temperature'].mean()

# Filter
winter = df[df['month'].isin([12, 1, 2])]

# Save
winter.to_csv('winter_data.csv', index=False)

Workflow 2: Data Analysis and Plotting

import xarray as xr
import matplotlib.pyplot as plt

# Load NetCDF file
ds = xr.open_dataset('temperature.nc')

# Extract variable
temp = ds['temperature']

# Compute statistics
mean_temp = temp.mean(dim=['lat', 'lon'])
std_temp = temp.std(dim=['lat', 'lon'])

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

mean_temp.plot(ax=axes[0])
axes[0].set_title('Mean Temperature')

std_temp.plot(ax=axes[1])
axes[1].set_title('Std Dev Temperature')

plt.tight_layout()
plt.savefig('temperature_stats.png', dpi=300)

Workflow 3: Batch Processing

import os
import glob
import pandas as pd

# Process multiple files
for filepath in glob.glob('data/raw/*.csv'):
    print(f"Processing {filepath}")

    df = pd.read_csv(filepath)
    df = df.dropna()
    df['processed'] = df['value'] * 2

    outfile = filepath.replace('raw', 'processed')
    df.to_csv(outfile, index=False)

print("Done!")

7.7. Debugging Tips

Using Print Statements

x = 10
print(f"DEBUG: x = {x}, type = {type(x)}")

Using Python Debugger (pdb)

import pdb

def problematic_function(x):
    pdb.set_trace()  # Execution pauses here
    result = x * 2
    return result

# In the debugger:
# n  - next line
# s  - step into function
# c  - continue
# p var - print variable
# l  - list code
# q  - quit

Or use interactive debugging:

python -m pdb script.py

Using VS Code Debugger

With Remote-SSH and the Python extension: 1. Set breakpoints by clicking line numbers 2. Press F5 to start debugging 3. Step through code with F10/F11 4. Inspect variables in the sidebar

7.8. Need Help?

Python docs: https://docs.python.org/3/
NumPy docs: https://numpy.org/doc/
Pandas docs: https://pandas.pydata.org/docs/
Matplotlib docs: https://matplotlib.org/stable/contents.html
xarray docs: https://docs.xarray.dev/
Stack Overflow: Search your error message
Ask Omar or colleagues: They’ve likely encountered similar issues
Read existing code: Look at scripts from groupmates for examples