Data analysis with python

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is SciPy?

  • How can I use Scipy?

Objectives
  • Learn about SciPy for spatio-temporal data analysis

  • Using SciPy clustering algorithms on spatio-temporal data

SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:

We won’t cover the usage of all these packages and will only give a few examples that are meaningful when working with spatio-temporal data.

You know some of these packages, for instance NumPy and Matplotlib; we have used them in previous chapters.

We also use partly IPython when using jupyter notebooks. pandas is one of the best options for working with tabular data in Python. The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy arrays. If you want to know more about pandas, have a look at the following tutorials/Carpentry lessons:

The usage of nose and Sympy are both outside the scope of this lesson.

SciPy

K-Means Clustering of a Satellite Images using Scipy

This part is taken from the excellent blog of Max Köning.

K-means is a widely used method in cluster analysis. However, this method is valid only if a number of assumptions is valid with your dataset:

If any one of these 3 assumptions are violated, then k-means will not be correct.

On major decision you have to take when using K-means is to choose the number of clusters a priori. However, as we will see below, this choice is critical and has a strong influence on the results:

Let’s take the following example where we apply K-means for different number of clusters on a netCDF file containing Monthly Average Precipitable Water over Ice-Free Oceans (kg m-2) October 2009:

import netCDF4
import numpy as np
from scipy.cluster.vq import *
from matplotlib import colors as c
import matplotlib.pyplot as plt
%matplotlib inline

f = netCDF4.Dataset('tpw_v07r01_200910.nc4.nc', 'r')
lats = f.variables['latitude'][:]
lons = f.variables['longitude'][:]
pw = f.variables['precipitable_water'][0,:,:]

f.close()
# Flatten image to get line of values
flatraster = pw.flatten()
flatraster.mask = False
flatraster = flatraster.data

# Create figure to receive results
fig = plt.figure(figsize=[20,7])
fig.suptitle('K-Means Clustering')

# In first subplot add original image
ax = plt.subplot(241)
ax.axis('off')
ax.set_title('Original Image\nMonthly Average Precipitable Water\n over Ice-Free Oceans (kg m-2)')
original=ax.imshow(pw, cmap='rainbow', interpolation='nearest', aspect='auto', origin='lower')
plt.colorbar(original, cmap='rainbow', ax=ax, orientation='vertical')
# In remaining subplots add k-means clustered images
# Define colormap
list_colors=['blue','orange', 'green', 'magenta', 'cyan', 'gray', 'red', 'yellow']
for i in range(7):
    print("Calculate k-means with ", i+2, " clusters.")
    
    #This scipy code clusters k-mean, code has same length as flattened
    # raster and defines which cluster the value corresponds to
    centroids, variance = kmeans(flatraster.astype(float), i+2)
    code, distance = vq(flatraster, centroids)
    
    #Since code contains the clustered values, reshape into SAR dimensions
    codeim = code.reshape(pw.shape[0], pw.shape[1])
    
    #Plot the subplot with (i+2)th k-means
    ax = plt.subplot(2,4,i+2)
    ax.axis('off')
    xlabel = str(i+2) , ' clusters'
    ax.set_title(xlabel)
    bounds=range(0,i+2)
    cmap = c.ListedColormap(list_colors[0:i+2])
    kmp=ax.imshow(codeim, interpolation='nearest', aspect='auto', cmap=cmap,  origin='lower')
    plt.colorbar(kmp, cmap=cmap,  ticks=bounds, ax=ax, orientation='vertical')
plt.show()

The netCDF file we used can be freely downloaded at here.

clustering K-means

K-means limitations

See here what can happen if you do not choose the “right” number of clusters or when your data do not K-means assumptions.

import netCDF4
import numpy as np
from scipy.cluster.vq import *
from matplotlib import colors as c
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed((1000,2000))

f = netCDF4.Dataset('tpw_v07r01_200910.nc4.nc', 'r')
lats = f.variables['latitude'][:]
lons = f.variables['longitude'][:]
pw = f.variables['precipitable_water'][0,:,:]

f.close()
# Flatten image to get line of values
flatraster = pw.flatten()
flatraster.mask = False
flatraster = flatraster.data


# In first subplot add original image
fig, (ax1, ax2, ax3)  = plt.subplots(3, sharex=True)

# Create figure to receive results
fig.set_figheight(20)
fig.set_figwidth(15)

fig.suptitle('K-Means Clustering')
ax1.axis('off')
ax1.set_title('Original Image\nMonthly Average Precipitable Water\n over Ice-Free Oceans (kg m-2)')
original=ax1.imshow(pw, cmap='rainbow', interpolation='nearest', aspect='auto', origin='lower')
plt.colorbar(original, cmap='rainbow', ax=ax1, orientation='vertical')
# In remaining subplots add k-means clustered images
# Define colormap
list_colors=['blue','orange', 'green', 'magenta', 'cyan', 'gray', 'red', 'yellow']

print("Calculate k-means with 6 clusters.")
    
#This scipy code classifies k-mean, code has same length as flattened
# raster and defines which cluster the value corresponds to
centroids, variance = kmeans(flatraster.astype(float), 6)
code, distance = vq(flatraster, centroids)
    
#Since code contains the clustered values, reshape into SAR dimensions
codeim = code.reshape(pw.shape[0], pw.shape[1])
    
#Plot the subplot with 4th k-means
ax2.axis('off')
xlabel = '6 clusters'
ax2.set_title(xlabel)
bounds=range(0,6)
cmap = c.ListedColormap(list_colors[0:6])
kmp=ax2.imshow(codeim, interpolation='nearest', aspect='auto', cmap=cmap,  origin='lower')
plt.colorbar(kmp, cmap=cmap,  ticks=bounds, ax=ax2, orientation='vertical')

#####################################

thresholded = np.zeros(codeim.shape)
thresholded[codeim==3]=1
thresholded[codeim==5]=2

#Plot only values == 5
ax3.axis('off')
xlabel = 'Keep the fifth cluster only'
ax3.set_title(xlabel)
bounds=range(0,2)
cmap = c.ListedColormap(['white', 'green', 'cyan'])
kmp=ax3.imshow(thresholded, interpolation='nearest', aspect='auto', cmap=cmap,  origin='lower')
plt.colorbar(kmp, cmap=cmap,  ticks=bounds, ax=ax3, orientation='vertical')

plt.show()

clustering K-means

%matplotlib inline
from scipy.spatial import distance
import numpy as np
import matplotlib.pyplot as plt
from skimage import measure


# Find contours at a constant value of 2.0
contours = measure.find_contours(thresholded, 1.0)

# Display the image and plot all contours found

fig = plt.figure(figsize=[20,7])
ax = plt.subplot()
ax.set_title('Original Image\nMonthly Average Precipitable Water\n over Ice-Free Oceans (kg m-2)')
original=ax.imshow(pw, cmap='rainbow', interpolation='nearest', aspect='auto', origin='lower')
plt.colorbar(original, cmap='rainbow', ax=ax, orientation='vertical')
for n, contour in enumerate(contours):
    dists = distance.cdist(contour, contour, 'euclidean')
    if dists.max() > 200:
        ax.plot(contour[:, 1], contour[:, 0], linewidth=2, color='black')
        print(dists.max())

clustering K-means

Save contours in a shapefile

To save the resulting contours, we need to get the coordinates of each point of the contour and create a polygon. We start from the preceding code and add the computation of the coordinates in lat/lon and store them in a shapefile, using shapely and geopandas python packages:

%matplotlib inline
from scipy.spatial import distance
import numpy as np
import matplotlib.pyplot as plt
from skimage import measure
import geopandas as gpd
from fiona.crs import from_epsg
from shapely import geometry


# Find contours at a constant value of 2.0
contours = measure.find_contours(thresholded, 1.0)

# Create an empty geopandas GeoDataFrame
newdata = gpd.GeoDataFrame()
# Create a new column called 'geometry' to the GeoDataFrame
newdata['geometry'] = None
# Set the GeoDataFrame's coordinate system to WGS84
newdata.crs = from_epsg(4326)

# Display the image and plot all contours found

fig = plt.figure(figsize=[20,7])
ax = plt.subplot()
ax.set_title('Original Image\nMonthly Average Precipitable Water\n over Ice-Free Oceans (kg m-2)')
original=ax.imshow(pw, cmap='rainbow', interpolation='nearest', aspect='auto', origin='lower')
plt.colorbar(original, cmap='rainbow', ax=ax, orientation='vertical')

ncontour=0

for n, contour in enumerate(contours):
    dists = distance.cdist(contour, contour, 'euclidean')
    if dists.max() > 200:
        ax.plot(contour[:, 1], contour[:, 0], linewidth=2, color='black')
        # Approximate latitudes/longitudes of the contour
        coords = []
        for c in contour:
            if int(c[0]) == c[0]:
                lat = lats[int(c[ 0])]
            else:
                lat = (lats[int(c[ 0])] + lats[int(c[0])+1])/2.0
            
            if int(c[1]) == c[1]:
                lon = lons[int(c[ 1])]
            else:
                lon = (lons[int(c[ 1])] + lons[int(c[1])+1])/2.0
            
            coords.append([lon,lat])
        
        poly = geometry.Polygon([[c[0], c[1]] for c in coords])
        newdata.loc[ncontour, 'geometry'] = poly
        
        newdata.loc[ncontour,'idx_name'] = 'contour_' + str(ncontour)
        print(ncontour)
        print(dists.max())
        ncontour += 1
        
# Write the data into that Shapefile
newdata.to_file("contour_test.shp")

Then we can plot it:

%matplotlib inline
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

fig = plt.figure(figsize=[12,15])  # a new figure window
ax = fig.add_subplot(1, 1, 1)  # specify (nrows, ncols, axnum)

map = Basemap(projection='cyl',llcrnrlat=-90,urcrnrlat=90,\
            llcrnrlon=0,urcrnrlon=360,resolution='c')

map.drawmapboundary(fill_color='aqua')
map.fillcontinents(color='#ffe2ab',lake_color='aqua')
map.drawcoastlines()

contour_test= map.readshapefile('contour_test', 'contour_test', linewidth=2.0, color="red")

raster concept

Key Points

  • scipy library