Data Formats in Environmental Sciences
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What are the most common data formats in Environmental Sciences?
raster vs vector formats?
What are the most common python packages to read/write netCDF, HDF, GeoTIFF data files?
Objectives
Learn to recognize the some of the most common data formats in the Environmental sciences.
Learn to open, read and write geoTIFF, netCDF and HDF files
-
In this lesson we will first review some of the most common spatial data formats used in environmental Sciences and how we categorize them as raster or vector format.
-
Then we will learn how to recognize raster formats such as GeoTIFF, netCDF and HDF data files.
-
Finally, we will see what are vector formats and in particular shapefiles.
This lesson is based on the Data Carpentry lesson Geospatial Data Formats.
Types of Spatial Data
Spatial data are represented in many different ways and are stored in different file formats. In this tutorial, we will focus on the two types of spatial data: raster and vector data.
raster data formats
Raster or “gridded” data are saved on a uniform grid and rendered on a map as pixels. Each pixel contains a value that represents an area on the Earth’s surface.
Source: National Ecological Observatory Network (NEON)
There are many different raster data file formats. In this lesson, we will focus on the
netCDF
NetCDF (network Common Data Form) is a:
- Self-Describing
- Portable
- Scalable
- Appendable
- Sharable
- Archivable
data format standard for exchanging scientific data.
Self-describing? | Portable? | Scalable? | Appendable? | Sharable? | Archivable? |
---|---|---|---|---|---|
Information describing the data contents of the file are embedded within the data file itself. This means that there is a header which describes the layout of the rest of the file, in particular the data arrays, as well as arbitrary file metadata in the form of name/value attributes. | A NetCDF file is machine-independent i.e. it can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.For instance, issues such as endianness being addressed in the software libraries. | A small subset of a large dataset may be accessed efficiently. | Data may be appended to a properly structured NetCDF file without copying the dataset or redefining its structure. | One writer and multiple readers may simultaneously access the same NetCDF file. | Access to all earlier forms of NetCDF data will be supported by current and future versions of the software. |
The netCDF project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research (UCAR).
As all self-describing data formats, netCDF includes a standard API (Application program interface) and portable data access libraries in a variety of languages, including python. There are netCDF tools that can open and work with arbitrary netCDF files, using the embedded descriptions to interpret the data.
What does it really mean when we say that a netCDF file is self-describing? Let’s find out with an example “MITgcm_state.nc”. If you do not have downloaded the metos_python data for this lesson, please see the setup instructions).
In this example, we will be looking at the metadataof a netCDF file. For this we open the netCDF file with the netCDF4 python package you were asked to install.
import xarray as xr nc_f = 'tpw_v07r01_200910.nc4.nc' # Your filename nc_fid = xr.open_dataset(nc_f) print(nc_fid) nc_fid.close()
<xarray.Dataset>
Dimensions: (expver: 2, latitude: 721, longitude: 1440, time: 506)
Coordinates:
* longitude (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
* latitude (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
* expver (expver) int32 1 5
* time (time) datetime64[ns] 1979-01-01 1979-02-01 ... 2021-02-01
Data variables:
u10 (time, expver, latitude, longitude) float32 ...
Attributes:
Conventions: CF-1.6
history: 2021-03-09 15:11:30 GMT by grib_to_netcdf-2.16.0: /opt/ecmw...
Check a netCDF file?
Most of the time, netCDF filename extensions are either .nc or .nc4. However, it is not a mandatory requirement so it is useful to learn to check if a file is a netCDF file:
$ file tpw_v07r01_200910.nc4.nc
tpw_v07r01_200910.nc4.nc: NetCDF Data Format data
file is a bash shell command and is not a netCDF utility. You can use it for any kind of files and it attempts torecognize its data format.
netCDF and python?
The most common low-level python packages to handle netcdf is called netcdf4 python package. The functionalities covered by this python package are close to those covered by Unidata netCDF library.
Start a new jupyter notebook and enter:
import netCDF4
Creating netCDF files
All of the following code examples assume that the netcdf4-python library has been imported:
import netCDF4
Then 4 following steps:
Create an empty netCDF4 dataset and store it on disk
foo = netCDF4.Dataset('foo.nc', 'w') foo.close()
Then you should find a new file called “foo.nc” in your working directory:
ls
foo.nc
The Dataset constructor defaults to creating NETCDF4 format objects. Other formats may be specified with the format keyword argument (see the netCDF4-python docs).
The first argument that Dataset takes is the path and name of the netCDF4 file that will be created, updated, or read. The second argument is the mode with which to access the file. Use:
- w (write mode) to create a new file, use clobber=True to over-write and existing one
- r (read mode) to open an existing file read-only
- r+ (append mode) to open an existing file and change its contents
Dimensions
Create dimensions on a dataset with the createDimension() method. We first create a new netCDF file but this time we don’t close it immediately and create 3 dimensions:
import netCDF4 f = netCDF4.Dataset('orography.nc', 'w') f.createDimension('time', None) f.createDimension('z', 3) f.createDimension('y', 4) f.createDimension('x', 5)
The first dimension is called time with unlimited size (i.e. variable values may be appended along the this dimension). Unlimited size dimensions must be declared before (“to the left of”) other dimensions. We usually use only a single unlimited size dimension that is used for time.
The other 3 dimensions are spatial dimensions with sizes of 3, 4, and 5, respectively.
The recommended maximum number of dimensions is 4. The recommended order of dimensions is time, z, y, x. Not all datasets are required to have all 4 dimensions.
Variables
Create variables on a dataset with the createVariable() method, for example:
lats = f.createVariable('lat', float, ('y', ), zlib=True) lons = f.createVariable('lon', float, ('x', ), zlib=True) orography = f.createVariable('orog', float, ('y', 'x'), zlib=True, least_significant_digit=1, fill_value=0) f.close()
The first argument to createVariable() is the variable name.
The second argument is the variable type. There are many way of specifying type, but Python built-in types work well in the absence of specific requirements.
The third argument is a tuple of previously defined dimension names. As noted above,
The recommended maximum number of dimensions is 4 The recommended order of dimensions is t, z, y, x Not all variables are required to have all 4 dimensions All variables should be created with the zlib=True argument to enable data compression within the netCDF4 file.
When appropriate, the least_significant_digit argument should be used to improve compression and storage efficiency by quantizing the variable data to the specified precision. In the example above the orography data will be quantized such that a precision of 0.1 is retained.
When appropriate, the fill_value argument can be used to specify the value that the variable gets filled with before any data is written to it. Doing so overrides the default netCDF _FillValue (which depends on the type of the variable). If fill_value is set to False, then the variable is not pre-filled.
In the example above the orography data will be initialized to zero, the appropriate value for grid points that are over ocean.
Writing and Retrieving Data
Writing Data
Variable data in netCDF4 datasets are stored in NumPy array or masked array objects.
An appropriately sized and shaped NumPy array can be loaded into a dataset variable by assigning it to a slice that span the variable:
import netCDF4 import numpy as np f = netCDF4.Dataset('orography.nc', 'w') f.createDimension('time', None) f.createDimension('z', 3) f.createDimension('y', 4) f.createDimension('x', 5) lats = f.createVariable('lat', float, ('y', ), zlib=True) lons = f.createVariable('lon', float, ('x', ), zlib=True) orography = f.createVariable('orog', float, ('y', 'x'), zlib=True, least_significant_digit=1, fill_value=0) # create latitude and longitude 1D arrays lat_out = [60, 65, 70, 75] lon_out = [ 30, 60, 90, 120, 150] # Create field values for orography data_out = np.arange(4*5) # 1d array but with dimension x*y data_out.shape = (4,5) # reshape to 2d array orography[:] = data_out lats[:] = lat_out lons[:] = lon_out # close file to write on disk f.close()
We defined a one-dimensional array data_out and reshape it to a 2 dimensional array and then store it in the orography netCDF variable.
Let’s have look to the netCDF file we generated and in particular its metadata:
import netCDF4 f = netCDF4.Dataset('orography.nc', 'r') print(f) f.close()
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
dimensions(sizes): time(0), z(3), y(4), x(5)
variables(dimensions): float64 lat(y), float64 lon(x), float64 orog(y,x)
groups:
Understand netCDF file
Check the orography.nc file:
- What is the first value of the orog variable?
- What is the missing value for lat and lon variables?
Solution
- The first value of orog is 0 but it appears as “_” which corresponds to a missing value (also called filled value). When we defined orog variable, we set its filled/missing value to 0. Therefore, every occurence of 0 will be flagged as a missing value.
- We haven’t specified any filled values for lat and lon so default missing values are set by default, depending on the type of variable; here lat and lon were defined as float so the default missing value is 9.969209968386869e+36. You can check the netCDF4 default filled values:
netCDF4.default_fillvals
Retrieve Data
Let’s first read the file we previously generated:
import netCDF4 import numpy as np f = netCDF4.Dataset('orography.nc', 'r') lats = f.variables['lat'] lons = f.variables['lon'] orography = f.variables['orog'] print(lats[:]) print(lons[:]) print(orography[:]) f.close()
lats vs. lats[:]?
- What is the type of
lats
?- What is the difference between
lats
andlats[:]
?Solution
lats is a netCDF variable: ` type(lats)
returns:
<class ‘netCDF4._netCDF4.Variable’>`lats is a netCDF variable; a lot more than a simple numpy array while lats[:] allows you to access the latitudes values stored in the lats netCDF variable. lats[:] is a numpy array.
Values can be retrieved using most of the usual NumPy indexing and slicing techniques.
However, there are differences between the NumPy and netCDF variable slicing rules; see the netCDF4-python docs for details.
Now if we only want to access a subset of the variable orog:
import netCDF4 import numpy as np f = netCDF4.Dataset('orography.nc', 'r') lats = f.variables['lat'] lons = f.variables['lon'] orography = f.variables['orog'] print(lats[:]) print(lons[:]) print(orography[:][3,2]) f.close()
Types of netCDF files?
There are four NetCDF format variants according to the Unidata NetCDF FAQ page:
- the classic format,
- the 64-bit offset format,
- the NetCDF-4 format, and
- the NetCDF-4 classic model format. While this seems to add even more complexity to using NetCDF files, the reality is that unless you are generating NetCDF files, most applications read NetCDF files regardless of type with no issues. This aspect has been abstracted for the general user!
The classic format has its roots in the original version of the NetCDF standard. It is the default for new files.
The 64-bit offset simply allows for larger dataset to be created. Prior to the offset, files would be limited to 2 GB. A 64-bit machine is not required to read a 64-bit file. This point should not be a concern for many users.
The NetCDF-4 format adds many new features related to compression and multiple unlimited dimensions. NetCDF-4 is essentially a special case of the HDF5 file format. netCDF4 is and extension to the classic model often called netCDF3. netCDF4 adds more powerful forms of data representation and data types at the expense of some additional complexity; it is based on HDF and therefore requires the installation of the HDF libraries prior to the installation of netCDF. It you installed netCDF without having HDF libraries on your machine, then you probably only have netCDF3.
The NetCDF-4 classic model format attempts to bridge gaps between the original NetCDF file and NetCDF-4.
Luckily for us, the NetCDF4 Python module handles many of these differences.
CF conventions?
netCDF format is self-describing but very flexible and you still have to decide how to encode your data into the format:
- Layout of data within the file
- Unambiguous names for fields; Use standard names if possible
- Units
- Fill/missing values
Therefore to simplify developments of tools and speed-up netCDF data processing, metadata standards for netCDF files have been created. The most common in our discipline is the Climate and Forecast metadata standard, also called CF conventions.
The CF conventions have been adopted by the Program for Climate Model Diagnosis and Intercomparison (PCMDI), the Earth System Modeling Framework (ESMF), NCAR, and various EU and international projects.
If you plan to create netCDF files, following CF conventions is recommended. However, if you are curious or encounter data using a different convention, Unidata maintains a list you can use to find out more information.
Let’s have a look at a netCDF file called sresa1b_ncar_ccsm3-example.nc
that follows CF conventions.
If you haven’t downloaded this file, see the setup instructions:
import xarray as xr f = xr.open_dataset("sresa1b_ncar_ccsm3-example.nc") print(f)
<xarray.Dataset>
Dimensions: (bnds: 2, lat: 128, lon: 256, plev: 17, time: 1)
Coordinates:
* lat (lat) float32 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
* lon (lon) float32 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
* plev (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 3e+03 2e+03 1e+03
* time (time) object 2000-05-16 12:00:00
Dimensions without coordinates: bnds
Data variables:
area (lat, lon) float32 ...
lat_bnds (lat, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
msk_rgn (lat, lon) int32 ...
pr (time, lat, lon) float32 ...
tas (time, lat, lon) float32 ...
time_bnds (time, bnds) object ...
ua (time, plev, lat, lon) float32 ...
Attributes: (12/18)
CVS_Id: $Id$
creation_date:
prg_ID: Source file unknown Version unknown Date unknown
cmd_ln: bds -x 256 -y 128 -m 23 -o /data/zender/data/dst_T85.nc
history: Tue Oct 25 15:08:51 2005: ncks -O -x -v va -m sresa1...
table_id: Table A1
... ...
references: Collins, W.D., et al., 2005:\n The Community Climate...
acknowledgment: Any use of CCSM data should acknowledge the contrib...
realization: 1
experiment_id: 720 ppm stabilization experiment (SRESA1B)
comment: This simulation was initiated from year 2000 of \n C...
model_name_english: NCAR CCSM
HDF
Hierarchical Data Format (HDF) is a data file format designed by the National Center for Supercomputing Applications (NCSA).
It is now developed and maintained by the HDF group.
Hierarchical Data Format, commonly abbreviated HDF, HDF4, or HDF5, is a library and multi-object file format for the transfer of graphical and numerical data between computers.
HDF supports several different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an API for reading, writing, and organising the data and metadata. New data models can be added by the HDF developers or users.
HDF is self-describing, allowing an application to interpret the structure and contents of a file without any outside information. One HDF file can hold a mixture of related objects, which can be accessed as a group or as individual objects.
Check an HDF file?
The file (the MISR Level 3 Land products May 2007) we will be using has been downloaded at https://l0dup05.larc.nasa.gov/L3Web/download. You first need to register (free registration) to download data from this link and for this tutorial, we have included a copy in the tutorial dataset.
$ file MISR_AM1_CGLS_MAY_2007_F04_0031.hdf
MISR_AM1_CGLS_MAY_2007_F04_0031.hdf: Hierarchical Data Format (version 4) data
Note that in addition it gives some information on the version (4). The HDF version may be important as we will explain later in this lesson.
HDF data files and python?
If you are using python anaconda, HDF files can be accessed in python using the netCDF4 python package, exactly as netCDF files. The reason is that netCDF files (netCDF4) is based on HDF5. pyhdf is also a very well know python package used to access HDF files. However, in this lesson we will only show how to handle HDF files with netCDF4 python package.
Let’s first look at MISR_AM1_CGLS_MAY_2007_F04_0031.hdf and gets metadata:
import xarray as xr f=xr.open_dataset('data/MISR_AM1_CGLS_MAY_2007_F04_0031.hdf') print("Metadata for the dataset:") print(f) print("List of available variables (or key): ") f.variable.keys() print("Metadata for 'NDVI average' variable: ") f.variables["NDVI average"]
Quick visualization with Python
We will see later how to customize our plots and make geographical maps but for now we just want to get a quick overview of our data with python using imshow from the matplotlib package.
%matplotlib inline import xarray as xr import matplotlib.pyplot as plt f=xr.open_dataset('MISR_AM1_CGLS_MAY_2007_F04_0031.hdf') data=f.variables['NDVI average'][:] print(type(data)) print(data.shape) plt.imshow(data) plt.show()
We are using matplotlib, the most common 2D plotting python package. Many other plotting python packages make use of matplotlib and offer to users higher-level user interface. This is why it is always useful to know about matplotlib.
Types of HDF files?
There are two versions of HDF technologies that are completely different: HDF4 and HDF5. HDF4 is the first HDF format. The HDF5 Format is completely different from HDF4 and different libraries are required to read/write these two different formats. Hopefully, the NetCDF4 Python module handles many of these differences. Nowadays, when we refer to HDF files, we often mean HDF5 data files.
EOS Conventions?
HDF-EOS is an extension of HDF4. On December 18, 1999, NASA and its international partners launched Terra, the first of the Earth Observing System (EOS) satellites planned for NASA’s Earth Science Enterprise (ESE) program. Although HDF meets many NASA specifications for accessing data, EOS applications required additional conventions and data types, which led to the development of HDF-EOS based on HDF4.
Tools for handling HDF-EOS can be found at the HDF-EOS Tools and Information Centre.
In 2007, HDF-EOS5 Data Model File Format became an approved standard recommended for use in NASA Earth Science Data Systems. As for HDF-EOS, it is an extension of HDF but it is now based on HDF5.
HDF4 or HDF5?
The examples used in the following exercises are HDF files downloaded at http://hdfeos.org/zoo/index_openGESDISC_Examples.php.
- What is the type of
OMI-Aura_L3-OMTO3e_2017m0105_v003-2017m0203t091906.he5
andAIRS.2002.08.30.227.L2.RetStd_H.v6.0.12.0.G14101125810.hdf
?Solution
OMI-Aura_L3-OMTO3e_2017m0105_v003-2017m0203t091906.he5
is an HDF5 file and more precisely HDF-EOS5 whileAIRS.2002.08.30.227.L2.RetStd_H.v6.0.12.0.G14101125810.hdf
is and HDF4 file. You can check either using thefile
command or in python:import xarray as xr f = xr.open_dataset('data/OMI-Aura_L3-OMTO3e_2017m0105_v003-2017m0203t091906.he5') print(f)
<class 'netCDF4._netCDF4.Dataset'> root group (NETCDF4 data model, file format HDF5): dimensions(sizes): variables(dimensions): groups: HDFEOS, HDFEOS INFORMATION
file AIRS.2002.08.30.227.L2.RetStd_H.v6.0.12.0.G14101125810.hdf
AIRS.2002.08.30.227.L2.RetStd_H.v6.0.12.0.G14101125810.hdf: Hierarchical Data Format (version 4) data
Quick plot with imshow
The examples used in the following exercises are HDF files downloaded at http://hdfeos.org/zoo/index_openGESDISC_Examples.php.
- Write a python program using imshow to visualize the 2D variable
ColumnAmountO3
fromOMI-Aura_L3-OMTO3e_2017m0105_v003-2017m0203t091906.he5
?- Do the same for the variable called
topog
forAIRS.2002.08.30.227.L2.RetStd_H.v6.0.12.0.G14101125810.hdf
?Solution
%matplotlib inline from netCDF4 import Dataset import matplotlib.pyplot as plt f = Dataset('data/OMI-Aura_L3-OMTO3e_2017m0105_v003-2017m0203t091906.he5') data = f.groups['HDFEOS'].groups['GRIDS'].groups['OMI Column Amount O3'].groups['Data Fields'].variables['ColumnAmountO3'] plt.imshow(data) plt.show() f.close()
2.
%matplotlib inline import xarray as xr f = xr.open_dataset('AIRS.2002.08.30.227.L2.RetStd_H.v6.0.12.0.G14101125810.hdf') data = f.variables['topog'] plt.imshow(data) plt.show()
GeoTIFF
TIFF (Tagged Image File Format) is a raster file format for digital images, created by Aldus Corporation (merged later with Adobe) widely used for raster imagery and aerial photoraphy.
A GeoTIFF file is a georeferenced TIFF, which means it can have geographic information such as map projections embedded within the header file of the TIFF itself.
This means that a GIS application can position the image in the correct location.
Strictly speaking, GeoTIFF is a metadata format, but the TIFF format enables both the data and metadata to exist in the same file.
GeoTIFFs are commonly used for a range of raster data, including:
- remote sensing & aerial photography;
- satellite data;
- cartographic representations, such as georeferenced topomaps.
GeoTIFFs are also often supported by GPS units.
Let’s take an example; a GeoTIFF MODIS image showing Southern Norway and Sweden (date: 16/08/2017) with True Color:
Source: MODIS Aqua 1km over Southern Norway and Sweden
We cannot use netCDF4 python package to read/write GeoTIFF data files but we can use the Geospatial Data Abstraction Library (GDAL) tools and library to inspect this file. This additional package needs to be installed (see setup instructions). This library is very powerful and can read/write most raster data formats, including netCDF and HDF. It means that the example we will be given for reading GeoTIFF files can also be applied to read the netCDF and HDF files we used in this lesson. See the list of data formats GDAL can handle here.
from osgeo import gdal datafile = gdal.Open('metos_python/data/Southern_Norway_and_Sweden.2017229.terra.1km.tif') print( "Driver: ",datafile.GetDriver().ShortName, datafile.GetDriver().LongName) print( "Size is ", datafile.RasterXSize, datafile.RasterYSize) print( "Bands = ", datafile.RasterCount) print( "Coordinate System is:", datafile.GetProjectionRef ()) print( "GetGeoTransform() = ", datafile.GetGeoTransform ()) print( "GetMetadata() = ", datafile.GetMetadata ())
Driver: GTiff GeoTIFF
Size is 910 796
Bands = 3
Coordinate System is: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433],AUTHORITY["EPSG","4326"]]
GetGeoTransform() = (4.0, 0.017582417582417617, 0.0, 62.0, 0.0, -0.008793969849246248)
GetMetadata() = {'AREA_OR_POINT': 'Area', 'TIFFTAG_SOFTWARE': 'ppm2geotiff v0.0.9'}
On Linux or Mac OSX, you can also use gdalinfo
command line to get similar information:
gdalinfo metos_python/data/Southern_Norway_and_Sweden.2017229.terra.1km.tif
Driver: GTiff/GeoTIFF
Files: Southern_Norway_and_Sweden.2017229.terra.1km.tif
Size is 910, 796
Coordinate System is:
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0],
UNIT["degree",0.0174532925199433],
AUTHORITY["EPSG","4326"]]
Origin = (4.000000000000000,62.000000000000000)
Pixel Size = (0.017582417582418,-0.008793969849246)
Metadata:
AREA_OR_POINT=Area
TIFFTAG_SOFTWARE=ppm2geotiff v0.0.9
Image Structure Metadata:
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( 4.0000000, 62.0000000) ( 4d 0' 0.00"E, 62d 0' 0.00"N)
Lower Left ( 4.0000000, 55.0000000) ( 4d 0' 0.00"E, 55d 0' 0.00"N)
Upper Right ( 20.0000000, 62.0000000) ( 20d 0' 0.00"E, 62d 0' 0.00"N)
Lower Right ( 20.0000000, 55.0000000) ( 20d 0' 0.00"E, 55d 0' 0.00"N)
Center ( 12.0000000, 58.5000000) ( 12d 0' 0.00"E, 58d30' 0.00"N)
Band 1 Block=910x3 Type=Byte, ColorInterp=Red
Band 2 Block=910x3 Type=Byte, ColorInterp=Green
Band 3 Block=910x3 Type=Byte, ColorInterp=Blue
This GeoTIFF file contains 3 bands (Red, Green and Blue), so we can extract each band and visualize one band with imshow:
bnd1 = datafile.GetRasterBand(1).ReadAsArray() bnd2 = datafile.GetRasterBand(2).ReadAsArray() bnd3 = datafile.GetRasterBand(3).ReadAsArray() plt.imshow(bnd1) plt.show()
To visualize our 3 bands as an RGB image, we need to stack the 3 indivual arrays into one image:
print(type(bnd1), bnd1.shape) print(type(bnd2), bnd3.shape) print(type(bnd3), bnd3.shape) img = np.dstack((bnd1,bnd2,bnd3)) print(type(img), img.shape) plt.imshow(img) plt.show()
vector data formats
Vector data are very common and efficient geospatial formats and are often used to store things like road and plot locations, boundaries of states, countries and lakes.
Vector data are composed of discrete geometric locations (x,y values) known as vertices that define the “shape” of the spatial object. The organization of the vertices determines the type of vector that we are working with: point, line or polygon, etc.
Source: National Ecological Observatory Network (NEON) http://datacarpentry.org/r-spatial-data-management-intro/images/dc-spatial-vector/pnt_line_poly.png
- Points: Each individual point is define by a single x, y coordinate. There can be many points in a vector point file. Examples of point data include: sampling locations, the location of individual trees or the location of plots.
- Lines: Lines are composed of many (at least 2) vertices that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has defined x, y location.
- Polygons: A polygon consists of 3 or more vertices that are connected and “closed”. Occasionally, a polygon can have a hole (or more than one) in the middle of it (like a doughnut), this is something to be aware of but not an issue we will deal with in this tutorial series. Objects often represented by polygons include: outlines of plot boundaries, lakes, oceans and states or country boundaries.
Some vector data formats can also include multipolygons (include several polygons) such as GeoJSON data format.
Shapefile
The most ubiquitous vector format is the ESRI shapefile. Geospatial Software company ESRI released the shapefile format specification as an open format in 1998. It was initially developed for their ArcView software but it became quickly an unofficial GIS standard.
Let’s take an example (downloaded from http://www.mapcruzin.com/free-norway-arcgis-maps-shapefiles.htm).
ls Norway_places
A-README.TXT places.dbf places.prj places.shp places.shx
When working with shapefiles, it is important to remember that a shapefile consists of 3 (or more) files:
- .shp: the file that contains the geometry for all features.
- .shx: the file that indexes the geometry.
- .dbf: the file that stores feature attributes in a tabular format.
These files need to have the same name and to be stored in the same directory (folder) to open properly in a GIS, R or Python tool.
Sometimes, a shapefile will have other associated files including:
- .prj: the file that contains information on projection format including the coordinate system and projection information. It is a plain text file describing the projection using well-known text (WKT) format.
- .sbn and .sbx: the files that are a spatial index of the features.
- .shp.xml: the file that is the geospatial metadata in XML format, (e.g. ISO 19115 or XML format).
Examine shapefile with python
from osgeo import ogr shapedata = ogr.Open('Norway_places')
Type shapedata. and press TAB to see what you can do.
shapedata.CopyLayer shapedata.Reference shapedata.__len__
shapedata.CreateLayer shapedata.Release shapedata.__module__
shapedata.DeleteLayer shapedata.ReleaseResultSet shapedata.__new__
shapedata.Dereference shapedata.TestCapability shapedata.__reduce__
shapedata.Destroy shapedata.__class__ shapedata.__reduce_ex__
shapedata.ExecuteSQL shapedata.__del__ shapedata.__repr__
shapedata.GetDriver shapedata.__delattr__ shapedata.__setattr__
shapedata.GetLayer shapedata.__dict__ shapedata.__str__
shapedata.GetLayerByIndex shapedata.__doc__ shapedata.__swig_destroy__
shapedata.GetLayerByName shapedata.__getattr__ shapedata.__swig_getmethods__
shapedata.GetLayerCount shapedata.__getattribute__ shapedata.__swig_setmethods__
shapedata.GetName shapedata.__getitem__ shapedata.__weakref__
shapedata.GetRefCount shapedata.__hash__ shapedata.name
shapedata.GetSummaryRefCount shapedata.__init__ shapedata.this
For instance, to get the number of layers:
shapedata.GetLayerCount()
1
Data Tip:
A shapefile will only contain one type of vector data: points, lines or polygons.
Then get the first layer and all the points from this layer, with feature “NAME” and the coordinates of each location:
layer = shapedata.GetLayer() places_norway = [] for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) name = feature.GetField("NAME") geometry = feature.GetGeometryRef() places_norway.append([i,name,geometry.GetGeometryName(), geometry.Centroid().ExportToWkt()]) print(places_norway[0:10])
[[0, 'Gol', 'POINT', 'POINT (8.9436636 60.7016106)'], [1, 'Halhjem', 'POINT', 'POINT (5.4263602 60.1455207)'], [2, 'Tromsø', 'POINT', 'POINT (18.9517967 69.6669861)'], [3, 'Oslo', 'POINT', 'POINT (10.7391223 59.913263)'], [4, 'Narvik', 'POINT', 'POINT (17.426652 68.4396792)'], [5, 'Bergen', 'POINT', 'POINT (5.3289029 60.3934769)'], [6, 'Hamna', 'POINT', 'POINT (18.9827839 69.7031209)'], [7, 'Stakkevollan', 'POINT', 'POINT (19.0031056 69.6937324)'], [8, 'Storslett', 'POINT', 'POINT (21.0301562 69.7694272)'], [9, 'Kvaløysletta', 'POINT', 'POINT (18.8708572 69.6953085)']]
Source: http://www.mapcruzin.com/free-norway-arcgis-maps-shapefiles.htm
To get the coordinates of each location in a readable manner, we exported it to WKT (Well-Known Text) format which is now part of ISO/IEC 13249-3:2016 standard. Well-known text (WKT) is a text markup language for representing vector geometry objects on a map, spatial reference systems of spatial objects and transformations between spatial reference systems. We won’t discuss more about this format in this lesson and use it only for printing information from our shapefile.
In our previous example, we could call geometry.GetPoint()
directly and get the very same information because our shapefile contains
points only. Using the method Centroid()
with polygons or lines would return a point only representing the centroids of the polygons or lines.
Shapefile with Points
We can then plot these locations on a map; as there were too many locations, we only plot locations of type city
:
%matplotlib inline from osgeo import ogr from mpl_toolkits.basemap import Basemap import matplotlib.pyplot as plt fig = plt.figure(figsize=[12,15]) # a new figure window ax = fig.add_subplot(1, 1, 1) # specify (nrows, ncols, axnum) ax.set_title('Cities in Norway', fontsize=14) map = Basemap(llcrnrlon=-1.0,urcrnrlon=40.,llcrnrlat=55.,urcrnrlat=75., resolution='i', projection='lcc', lat_1=65., lon_0=5.) map.drawmapboundary(fill_color='aqua') map.fillcontinents(color='#ffe2ab',lake_color='aqua') map.drawcoastlines() shapedata = ogr.Open('Norway_places') layer = shapedata.GetLayer() for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) name = feature.GetField("NAME") type = feature.GetField("TYPE") if type == 'city': geometry = feature.GetGeometryRef() lon = geometry.GetPoint()[0] lat = geometry.GetPoint()[1] x,y = map(lon,lat) map.plot(x, y, marker=marker, color='red', markersize=8, markeredgewidth=2) ax.annotate(name, (x, y), color='blue', fontsize=14) plt.show()
We do not explain here in details how we made this plot because this the purpose of the next lesson!
Shapefile with linestring
A shapefile with lines is very similar to a shapefile with points because each linestring is made of a set of points:
%matplotlib inline shapedata = ogr.Open('Norway_railways') nblayer = shapedata.GetLayerCount() print("Number of layers: ", nblayer) layer = shapedata.GetLayer() railways_norway = [] for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) name = feature.GetField("NAME") geometry = feature.GetGeometryRef() railways_norway.append([i,name,geometry.GetGeometryName(), geometry.GetPoints()]) print(railways_norway[0:10])
To visualize it:
%matplotlib inline from mpl_toolkits.basemap import Basemap import matplotlib.pyplot as plt fig = plt.figure(figsize=[12,15]) # a new figure window ax = fig.add_subplot(1, 1, 1) # specify (nrows, ncols, axnum) map = Basemap(llcrnrlon=-1.0,urcrnrlon=40.,llcrnrlat=55.,urcrnrlat=75., resolution='i', projection='lcc', lat_1=65., lon_0=5.) map.drawmapboundary(fill_color='aqua') map.fillcontinents(color='#ffe2ab',lake_color='aqua') map.drawcoastlines() norway_roads= map.readshapefile('Norway_roads/roads', 'roads') plt.show()
Filter by attributes
In the preceding example, we extracted all the locations from `Norway_places` and then made a plot for big cities only. For this selection, we used a simple `if` statement. In this exercise, we show how to better filter by attributes. Use the preceding example to access information from `Norway_roads` but only for `Stolmakergata`. For instance, use SetAttributeFilter and GetNexFeature to pick-up `stolmakergata` only.
Solution
We used
SetAttributeFilter
andGetNexFeature
to selectstolmakergata
:layer = shapedata.GetLayer() layer.SetAttributeFilter("NAME = 'Stolmakergata'") detail = layer.GetNextFeature() geometry = detail.GetGeometryRef() print("Type of geometry: ", geometry.GetGeometryName()) # go through each points of the line. print(geometry.GetPointCount()) for i in range(geometry.GetPointCount()): xy = geometry.GetPoint(i) print(xy)
Shapefile with polygons
The image below shows a plot of a shapefile containing polygons:
The python code to make this plot is given below for information only. Plotting of shapefiles
will be discussed in the next lesson.
%matplotlib inline from mpl_toolkits.basemap import Basemap import matplotlib.pyplot as plt fig = plt.figure(figsize=[12,15]) # a new figure window ax = fig.add_subplot(1, 1, 1) # specify (nrows, ncols, axnum) map = Basemap(llcrnrlon=-1.0,urcrnrlon=40.,llcrnrlat=55.,urcrnrlat=75., resolution='i', projection='lcc', lat_1=65., lon_0=5.) map.drawmapboundary(fill_color='aqua') map.fillcontinents(color='#ffe2ab',lake_color='aqua') map.drawcoastlines() norway_natural= map.readshapefile('Norway_natural/natural', 'natural') plt.show()
Explore Shapefile with polygons
Use the preceding example to access information from `Norway_natural`. You have to dig one level deeper and access the geometry contained within the polygon geometry. The reason is that a polygon can have 'holes'. So the first polygon is what we call the outer rings and the other polygons are 'holes' Then the polygon information is stored as a line (set of points) so you can access each point of the polygon.
Solution
There is not one solution to this question but here is one way to do it:
layer = shapedata.GetLayer() layer.SetAttributeFilter("NAME = 'Grasdalsvatnet'") detail = layer.GetNextFeature() geometry = detail.GetGeometryRef() print("Type of geometry: ", geometry.GetGeometryName()) print("Number of 'rings': ", geometry.GetGeometryCount()) # Get the coordinates of the whole line line = geometry.GetGeometryRef(0) # all points in the line print(line.GetPointCount()) for i in range(line.GetPointCount()): xy = line.GetPoint(i) print(xy)
polygons versus multi-polygons:
- polygons can have ‘holes’ as shown on the figure below:
- but we can also have multi-polygons; and some of them with holes too!
GeoJSON
GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes, based on JavaScript Object Notation. For the full specifications of this standard, refer to RFC 7946.
This section is based on the GeoJSON documentation provided by Tom MacWright.
Handling GeoJSON in python is very similar to handling shapefiles and we can for instance use the same gdal ogr
python package.
However, before using python, let’s look at a simple GeoJSON file.
You can use your favourite editor to inspect the content of a simple GeoJSON file as it is a “text” file. It has some advantages (portable and readable on any machines) but as you will see when working with GeoJSON files, they can become very quickly too big compared to any binary counterpart.
Let’s look at
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [-118.17, 34.03]
},
"properties": {
"name": "East Los Angeles CA",
"country.etc": "CA",
"pop": " 125121",
"capital": "0"
}
}
]
}
GeoJSON supports the following geometry types:
- Point,
- LineString,
- Polygon,
- MultiPoint,
- MultiLineString,
- MultiPolygon.
Geometric objects with additional properties are Feature objects. Sets of features are contained by FeatureCollection objects.
To read the preceding file la_city.geoson
in python, we can use gdal ogr
and it is very similar to what we did when reading
shapefiles:
from osgeo import ogr la = ogr.Open('la_city.geojson') nblayer = la.GetLayerCount() print("Number of layers: ", nblayer) layer = la.GetLayer() cities_us = [] for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) name = feature.GetField("NAME") geometry = feature.GetGeometryRef() cities_us.append([i,name,geometry.GetGeometryName(), geometry.GetPoints()]) print(cities_us)
It can of course become cumbersome with complex GeoJSON file containing MultiPolygons, MultiLineStrings, etc. but the principle remains the same. Let’s take another example:
Source: Grensedata Norge WGS84 Adm enheter geoJSON downloaded from kartverket; registration is free for open data access
from osgeo import ogr shapedata = ogr.Open('no-all-all.geojson') nblayer = shapedata.GetLayerCount() print("Number of layers: ", nblayer) layer = shapedata.GetLayer() county_norway = [] for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) name = feature.GetField("NAVN") geometry = feature.GetGeometryRef() county_norway.append([i,name,geometry.GetGeometryName(), geometry.Centroid().GetPoint()]) print(county_norway[0:10])
We will see more about GeoJSON when we discuss interctive visualization and publishing on the web.
Key Points
raster formats: GeoTIFF, netCDF, HDF data formats
vector formats: shapefile, GeoJSON data format