Friday, December 14, 2012

Calling GDAL/OGR commands from Python

key words: python, GDAL/OGR, subprocess.check_call, os.walk, external commands, environment variables

If you plan to call GDAL/OGR commands from python code you might need to set up environment variables before you can do that. Environment variables can be defined at system level and in that case you don’t need to do that but when they are not defined on the system level you will have to take care of them.

Calling external commands from python can be accomplished by subprocess module. There are several function that can be useful for this task but it this post I didn’t show all the differences between these function the main thing I wanted to show you is how to set up environment variables required by some external program called by subprocess module. In this particular case external program is any of GDAL/OGR commands.

If you want to write porogram that calls GDAL/OGR I can assume that you need it to process multiple files and that is why this sample program uses os.walk function. Walk function allows you to process multiple files in the multiple folders. This sample filters all tif files that can be found in grid_data folder.

Using subprocess subprocess.check_call function I have demonstrated invocation of gdalinfo command for all TIFF files and just as an example one OGR command (ogrinfo). The same principles can be applied for file processing using any of available GDAL/OGR commands.

Here is the source code.

.... enjoy ...

'''
Created on 10. okt. 2012
@author: Stipica Pavicic
'''
import os, fnmatch, subprocess

for (root, dirs, files) in os.walk(r'L:\sp\python\grid_data'):
    #print (root)    
    #print (os.listdir(root)) # list of files and folders
    #print (fnmatch.filter(os.listdir(root),".")) # filter the list of files and folders
    
    for filename in fnmatch.filter(files,"*.tif"):
      
        fbasename, extension = os.path.splitext(filename)
        
        gdal_root = r'C:\release-1310-gdal-1-9-mapserver-6-0'
        pathv=''
        pathv= pathv + gdal_root  + r'\bin;' 
        pathv= pathv + gdal_root  + r'\bin\gdal\python\osgeo;' 
        pathv= pathv + gdal_root  + r'\bin\proj\apps;' 
        pathv= pathv + gdal_root  + r'\bin\gdal\apps;' 
        pathv= pathv + gdal_root  + r'\bin\ms\apps;' 
        pathv= pathv + gdal_root  + r'\bin\gdal\csharp;'  
        pathv= pathv + gdal_root  + r'\bin\ms\csharp;' 
        pathv= pathv + gdal_root  + r'\bin\curl;'
        pathv= pathv + gdal_root  + r'\bin\gdal\apps'      
        
        
        #if you don't work with Oracle or you don't need Oracle plugin just rename/remove \bin\gdal\plugins\gdal_GEOR.dll & ogr_OCI.dll files to avoid unnecessary error messages
        env_cust={"GDAL_DATA":        gdal_root + r'\bin\gdal-data', \
                  "GDAL_DRIVER_PATH": gdal_root + r'\bin\gdal\plugins',  \
                  "SET PYTHONPATH":   gdal_root + r'\bin\gdal\python\osgeo;' +  
                                      gdal_root + r'\bin\ms\python', \
                  "SET PROJ_LIB":     gdal_root + r'\bin\proj\SHARE', \
                  "PATH": pathv}
        
        os.chdir(root)
        #print (root, filename, fbasename, extension)
        #subprocess.check_call(['echo' , filename], env=env_cust, shell=True)
        subprocess.check_call(['gdalinfo', filename], env=env_cust, shell=True)
        subprocess.check_call(['ogrinfo', '--help-general'], env=env_cust, shell=True)

Monday, December 3, 2012

Automation of Raster/grid data processing using ArcGIS and python

key words: automation, processing, python, ArcGIS, arcpy, xml, minidom

The main idea for this post was to define way how to process datasets that are stored as a files and organized in multiple folders. To be more specific I will describe some particular case.

Task: Create GeoTiff files from delivered ASCII files (x,y,z). ASCII files are organised in folders: the root folder is “grid_data” and there are several subfolders grouping the data according to resolution (5m grid, 2m grid). There is no coordinate system definition attached to files. Coordinate system definition is delivered in separated pdf document as well as information about grid resolution.

Before processing you always have to make some preparation especially if you want to make you processing autonomous and not dependent to operator. In this case I’ve decided to create xml file (info.xml) that will hold some basic information about delivered data as well as some other information about desired parameters for processing. Here is sample file.
<?xml version="1.0" encoding="UTF-8"?>
<geoprocessing> 
 <input_data>
  <!-- dataset resolution in m --> 
  <res>5</res>
  <!-- name of ESRI prj file -->
  <srs>C:\sp\python\grid_data\srs\ED_1950_UTM_Zone_31N.prj</srs>
 </input_data>
 <results>
  <!-- Root folder for processed files. Leave empty if you want to use same folder as input data -->
  <root_folder></root_folder>
  <!-- name of ESRI prj file. Leave empty if you want to use same srs as input data -->
  <srs></srs>
 </results>
</geoprocessing>
I have decided to use xml for this purpose because it is more flexible when you need to make some changes (if you want to use similar approach to solution of some other task). To parse this xml file I have created following python class:
'''
Created on 23. nov. 2012

@author: Stipica Pavicic
'''
from xml.dom.minidom import parseString


'''
configuration sample file m:\python\info.xml
'''       

class ConfigObj(object):
    '''
    Parsing of xml file that holds configuration parameters for porcessing.
    Parameters are separated in two sections <input_data> and <results>.
    '''
    _dom=''
    _result_obj=''
    _input_obj=''
    
    class InputObj(object):
        '''
        parsing of <input_data> segment of info.xml file
        '''
        _res = 'not_set'
        _srs = 'not_set'
    
        def __init__(self, dom_obj):
            
            node = dom_obj.getElementsByTagName('input_data')[0]
            
            dom_elem = node.getElementsByTagName('srs')[0].firstChild
            if dom_elem:
                self._srs = dom_elem.data

    
            dom_elem = node.getElementsByTagName('res')[0].firstChild
            if dom_elem:
                self._res = dom_elem.data
         
        
        def getSrs(self):
            return self._srs
        
        def getRes(self):
            return self._res 
    
    class ResultObj(object):
        '''
        parsing of <results> segment of info.xml file
        '''
        _root_folder='not_set'
        _srs='not_set'
    
        def __init__(self, dom_obj):
            
            node = dom_obj.getElementsByTagName('results')[0]
            
            dom_elem = node.getElementsByTagName('srs')[0].firstChild
            if dom_elem:
                self._srs = dom_elem.data
            else:
                # in the case if result srs is not defined in xml file use the input data srs 
                nodei = dom_obj.getElementsByTagName('input_data')[0]
                dom_elemi = nodei.getElementsByTagName('srs')[0].firstChild
                self._srs = dom_elemi.data                
    
            dom_elem = node.getElementsByTagName('root_folder')[0].firstChild
            if dom_elem:
                self._root_folder = dom_elem.data
   
        
        def getSrs(self):
            return self._srs
        
        def getRootFolder(self):
            return self._root_folder 
        

    def __init__(self, info_xml_path):

        try:
            myfile = open(info_xml_path,'r')
            #convert to string:
            mydata = myfile.read()
            #close file because we dont need it anymore:
            myfile.close()
            #parse the xml you got from the file
            self._dom = parseString(mydata)
            self._result_obj = self.ResultObj(self._dom)
            self._input_obj = self.InputObj(self._dom)
        except IOError:
            print("IO Error")
        except:
            print("General Error!")
        
        
    def getDom(self):
        return self._dom
    
    def getGetResultParam(self):
        '''
        access  parameters
        '''
        return self._result_obj
    
    def getGetInputParam(self):
        '''
        access  parameters
        '''
        return self._input_obj
To use this class you need to create xml file with configuration parameters (eg. c:\test\info.xml). Here is the sample code how to access parameters:
from my.xml.configurator import ConfigObj

conf = ConfigObj(r'c:\test\info.xml')
print conf.getGetInputParam().getSrs()
print conf.getGetInputParam().getRes()
print conf.getGetResultParam().getSrs()
print conf.getGetResultParam().getRootFolder()
As you can see it is very easy to adjust python code to some different structured xml file. Maybe you need to add additional parameters inside <input> section, or maybe you want to add additional section for some other group of parameters (something like <global>). All this depends on the issue you want to solve using this approach. For this particular case I’ll stick with the parameters that already exists in presented xml file ..I’m calling it info.xml but it can be any other name :-) So... for now you know how to create configuration file, to create some parameters and values, and how to access those parameters using python code (class ConfigObj). Next step is to create script that process all files in all subfolders starting from some arbitrary root folder. Script will search for configuration files (info.xml) and processing will be performed only for those folders. Here is code for that. The processing is straightforward but I will make some comments just to explain idea. Some comments are also included in source code.
'''
Created on 23. nov. 2012
@author: Stipica Pavicic
'''

from my.xml.configurator import ConfigObj
from my.process_timer import Timer

import os, fnmatch
import arcpy
import random
import shutil
from arcpy.conversion import PointToRaster

arcpy.CheckOutExtension("3D")

def createTmpGDB(folder):
    '''creates temporary file geodatabase'''
    #temp geodatabase name
    tmp_gdb_name = 'tempXXX.gdb'
    
    #redefine name if already exists
    while os.path.exists(folder + os.sep + tmp_gdb_name):
        tmp_gdb_name = tmp_gdb_name.replace('XXX',str(random.randint(1000000, 9000000)))
    
    arcpy.CreateFileGDB_management(folder, tmp_gdb_name )
        
    return tmp_gdb_name

def deleteTmpGDB(tmp_gdb):
    '''removes temporary file geodatabase'''    
    shutil.rmtree(tmp_gdb)
    return 0 
    

def proces_grid_file(in_file_folder, out_file_folder, resolution, in_srs, out_srs, filename, temporery_gdb):
    try:
        fbasename, extension = os.path.splitext(filename)
        in_csv = in_file_folder + os.sep + filename
        #point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + fbasename 
        #basename can contain ilegal characters for arcpy that is why I use random name (for this particular case the name of intermediate results is not important 
        point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + str(random.randint(1000000, 9000000))
        while os.path.exists(point_data):
            #redefine point_data file name if already exists
            point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + str(random.randint(1000000, 9000000))
            
        out_tif = out_file_folder + os.sep + fbasename + ".tif"

        arcpy.ASCII3DToFeatureClass_3d(in_csv, "XYZ", point_data, "POINT", "", out_srs, "", "", "DECIMAL_POINT")
        arcpy.PointToRaster_conversion(point_data , "Shape.Z", out_tif, "MOST_FREQUENT", "NONE", str(resolution))
        #process finished without exception
        return 1
    except Exception as e:
        print e
        #process finished with exception
        return 0
    

for (root, dirs, files) in os.walk(r'C:\sp\python\grid_data'):

    measure = Timer()   
    config_file_name='info.xml'
    filter='*.csv'
    
    # get parameters from xml file (conf) and define configuration for processing the files in folder (root)
    if config_file_name in files:
        print '*****************************************************************************'
        print '>>>> Root    : ', root
        print 'Nr. of files : ' , len(fnmatch.filter(files,filter))  # nuber of files in folder that match filter expression 
        print 'Config file  : ' , root + os.sep + config_file_name
        conf = ConfigObj(root + os.sep + config_file_name)
        
        #open prj files
        inputSrsFile   = open(conf.getGetInputParam().getSrs(),'r')
        resultSrsFile  = open(conf.getGetResultParam().getSrs(),'r')
        
        #read files
        inputSrs  = inputSrsFile.read()
        resultSrs = resultSrsFile.read()  
        
        #close files 
        inputSrsFile.close()
        resultSrsFile.close()
        
        print 'spatial reference system definition'
        print 'input data srs :' , inputSrs
        print 'results    srs :'  , resultSrs   
        
        
        # create temporary file geodatabase
        tmp_gdb = createTmpGDB(root)
        
        
        # loop though list of files that match filter criteria (filter) & process files
        cnt = 0
        for filename in fnmatch.filter(files,filter):
            print '-----------------------------------------------------------------------------'
            cnt = cnt + 1    
            print cnt, ' -> processing file: ', filename
            fbasename, extension = os.path.splitext(filename)        
            #print root, '||', filename, '||', fbasename, '||', extension
            r=proces_grid_file(root, root, conf.getGetInputParam().getRes(), inputSrs, resultSrs, filename, tmp_gdb)
            if r:
                # r=0 process is finished without exception
                print filename, 'processing finished!'
            else:
                # r=1 process is finished with exception
                print filename, ': file is not processed!'
            
            print filename, measure.getTimeDifference()
        
        #delete temporary file geodatabase
        deleteTmpGDB(root + os.sep + tmp_gdb)
        
        # total time for processing
        print measure.getElapsedlTime()
Script starts at the line 59 but before that I have included some helper functions:
  • to create temporary file geodatabase (line 17)
  • to remove temporary file geodatabase (line 30)
  • file processing function (line 36)


... how to use this code?

Firstly you need to look how the files are organised. If they are stored in folder by grouping according to some joint property there is a good chance that this code can be useful to you, no matter if you need to perform the same processing (as in this post) or not. For this particular case files were stored in folders according to the resolution and the spatial reference system, ie. files with the same resolution and the same spatial reference system should be in the same folder. You can split files with the same properties in multiple folders but you cannot put the files with the different properties in the same folder.

Secondly, for all folders that you want to process, create one xml file with the properties. Use the same name for all xml files (see the code line 62; info.xml). Folders without configuration file won’t be processed.

Thirdly, change the filter you want to use for your case (line 63). Files I had to process were delivered as csv files.

And finally, run the script. To be able to run it you will have to establish working enviroment (all import should be accessible by script). To measure processing time I have also used class described in one of my previous posts. If you don’t want to use it just comment lines 7, 61, 110 and 116 (all lines related to Timer class).

Feel free to use this code and to modify it according to your needs. I hope that I will make a new post to show you how to modify existing code and how to use it by following this concept: walking through the folders and processing files based on the xml configuration file.

.....

... enjoy!