Introduction

Jython is the Java implementation of the Python programming language. It is a powerful tool for parsing data, accessing databases and other every day development tasks. One of the batteries included is an implementation of the Python DB API, zxJDBC. In this article we will parse some hypothetical triathlon results, insert them into MySQL and perform a couple data mining operations.

Let’s break the task into two parts. First, figure out the general layout of the data file and determine the best approach for parsing the values. Secondly, establish a connection to MySQL through zxJDBC and learn how to insert the data.

It is assumed the reader has a properly configured MySQL instance running.

Parsing

Below is a sample of the data:

NAME             BIKE  SWIM RUN  TOTAL
---------------- ----- ---- ---- -----
ARAGORN          12.23 0.84 2.75 15.82
GANDALF          11.67 1.32 2.73 15.72
MERRY            9.96  0.88 1.77 12.61
ARWEN            9.28  1.12 1.72 12.12
GALADRIEL        7.97  1.24 1.88 11.09
SAM              7.2   0.80 2.13 10.13
LEGOLAS          7.89  0.44 1.21  9.54

Analyzing the data it’s pretty clear it’s fixed width columns and should be easily parsed but we need to decide on an approach. A table of the column indices and a function to handle the data should be more than sufficient for the data. Other solutions are definitely possible but this approach is potentially the simplest.

We need a function that can turn arbitrary data into a string and remove the extraneous whitespace fixed width columns exhibit:

def string(s):
  return str(s).strip()

The second thing to do is build the table of column indices and functions to handle the data type:

# our parsing rules (start, end, data converting function)
columns = {
  'name'  :(0, 16, string),
  'bike'  :(17, 22, float),
  'swim'  :(23, 27, float),
  'run'   :(28, 32, float),
  'total' :(33, 38, float)
}

The structure we have chosen is pretty straight forward. It is a Python dictionary with the column name as the key and a tuple as the value. The tuple contains the starting index, the ending index and the function to convert the data string to the data type we’re interested in storing in the database. If you’re familiar with Python you’ll note the use of the builtin function ‘float’ as one handler. The only other handler we need is our whitespace-stripping string function.

Now that we have the parsing rules built, we can focus on the parsing. One of the most straight forward approaches is to read all the lines from the file, apply the rules to each and build a dictionary of columns and values from the result of calling the function on the string data. We can accomplish this with the function below:

def parse(fn):
  '''
  Parse the file and return a list of dictionaries containing
  the column/value pairs from the row.
  '''

  # open the file
  fp = open(fn, "r")

  # first read all the lines, then strip any whitespace
  #  finally, filter any rows that have no length
  data = filter(lambda x: x, map(lambda x: x.strip(), fp.readlines()))

  # we don't need the header columns, we already have that
  #  information captured in the columns dictionary
  data = data[2:]

  # it's important to close what we open
  fp.close()

  # a list of all the processed rows
  results = []

  # iterate the data
  while data:
    # build our datastore
    ds = {}

    # get the row to be processed
    row = data[0]

    # iterate the columns
    for name, rules in columns.items():

      # unfold our rules
      start, end, func = rules

      # store the results of applying our rule
      ds[name] = func(row[start:end])

    # add the processed row to the rest
    results.append(ds)

    # since we just worked on this row we can get rid of it
    del data[0]

  # return the results of parsing the file
  return results

The parse function works rather methodically through all the rows of the file and returns the parsed results. For each row it creates a new instance of a dictionary to store the value for each column. It then proceeds to slice the line into the fixed widths and apply the function from our simple rules table. The result of this evaluation is stored in the dictionary by column name. The function then returns the list of dictionaries representing database rows.

Databases

Now that we have the ability to generate the parsed data, let’s start looking at how to create a database connection. The DB API details two primary classes, connections and cursors. We will first look at connections and how to open a connection and then look at cursors. There is no absolute method for establishing a connection through the API, there is a general convention but not all databases expect the same information. Since we’re focusing on zxJDBC we’ll look at how to establish a connection through it’s API.

zxJDBC is wrapper around JDBC and therefore needs a valid java.sql.Driver in order to operate. We will be using the open-source MySQL driver com.mysql.jdbc.Driver to connect to MySQL. In order to establish our connection we need the URL to the database, a username, password and the fully qualified name of the driver. A typical function for opening a connection is below:

def get_connection():
  """Open and return a database connection."""

  # the parameters for establishing a connection
  url = "jdbc:mysql://localhost/racing"
  username = "bzimmer"
  password = None
  jdbcdriver = "com.mysql.jdbc.Driver"

  # establish the connection and return it
  from com.ziclix.python.sql import zxJDBC
  db = zxJDBC.connect(url, username, password, jdbcdriver)

  return db

The URL specifies it’s for JDBC and tells the driver the hostname and database, in this case localhost and racing respectively. There is a username bzimmer but the database is not password protected. The final piece of data provided is the driver name. After making local variables for all the information we proceed to open a connection through the zxJDBC factory method connect. An open connection to the MySQL database is returned.

Once we’ve established the connection we can begin inserting the data we parsed earlier. The procedure for inserting the data into the database is, like the parsing, also straight forward. Since we have all the data we need we’ll iterate the list, generate the SQL and execute it.

def store(data):
  """Store the data in the database."""

  # get a new connection
  db = get_connection()

  # open a cursor so we can perform our work
  cursor = db.cursor()

  try:
    try:
      # since we have all new results, drop the existing table
      cursor.execute("drop table triathlon")
    except:
      # it might not be there
      pass

    # create the table
    cursor.execute(ddl)

    for row in data:
      # build our sql dynamically
      items = row.items()

      keys = map(lambda x: x[0], items)
      values = map(lambda x: x[1], items)

      # this sql is for a prepared statement
      sql = "insert into triathlon (%s) values (%s)" \
        % (",".join(keys), ",".join("?" * len(values)))

      # execute the sql
      cursor.execute(sql, values)
  finally:
    # close what we open
    cursor.close()
    db.close()

The cursor object of the Python DB API was mentioned early and it is probably obvious from the store function that it is the workhorse in the API. To obtain a cursor invoke the appropriately named cursor method on an open connection instance. Any number of cursors can be acquired from an open connection and a cursor is reusable for any number of invocations until it, or the underlying connection, is closed.

After getting a cursor we drop the existing table if it exists to minimize any data problems. We wrap the statement in a try/catch block since we want to continue even if it fails, which it certainly will the first time we execute the function. Once the table is created from the DDL we are ready to start populating our new table.

Recall from earlier that data is a list of dictionaries, each dictionary containing the values for all the columns for one row in the database table triathlon. We set up a for loop to handle iterating the list. The first line of the for loop gets the column/data tuples for the row. The next two lines then pull the keys and values apart into separate lists for use in generating the SQL statement.

SQL can be in one of two forms, either parameterized or not. If parameterized, in place of having the string representation for each data value a ? is inserted in it’s place. This is useful because escaping SQL strings is sometimes quite difficult and parameterizing a statement sometimes allows for optimization at both the driver and database. Dissecting the line we find the string constant:

insert into triathlon (%s) values (%s)

requires two substitutions. The first substitution is the list of column names to be inserted separated by a commas. The second substitution is a list of ? of length equal to the number of values to be inserted (which must equal the number of columns). The latter is accomplished through:

",".join("?" * len(values))

which would produce ?,?,? for an insertion of three values.

We finally arrive at the point where can insert the data. We invoke the execute() method on the cursor with the SQL statement we just created and the values we parsed out earlier. Had we created a fully qualified SQL string rather than chosen to use a prepared statement we would not need to pass the values to execute() since the data would be embedded in the SQL string.

After handling all the data a finally block closes the cursor and connection. We’re now done inserting the data but we still have to answer our question: which athlete is strongest on the bike relative to the other events?

Performing queries

To answer the question we need to perform a query. This is very similar to inserting data, the primary difference being we’re interested in obtaining a result set from the database. The function below demonstrates how to perform the query and fetch the results. Once execute() has been invoked the cursor stores the results of the query internally (depending on the implementation of the driver and/or cursor this might be on the server or the entire result set might have been copied to the client) and are available through the fetchXXX methods.

The three fetchXXX methods are: fetchone, fetchmany and fetchall. The methods fetchmany and fetchall return a list of tuples whereas fetchone returns a single tuple. The method fetchmany has an optional argument to specify how many rows to return in the list; fetchall returns all remaining rows. All three methods return None to indicate now more rows are available in the result set.

def strongest_bike():
  # get a new connection
  db = get_connection()

  # open a cursor so we can answer our question
  cursor = db.cursor()
  try:
    # find the answer
    cursor.execute("select name, bike/total from triathlon order by 2 desc")
    for row in cursor.fetchall():
      print row

  finally:
    # close what we open
    cursor.close()
    db.close()

It is also possible to run queries in the interactive interpreter:

>>> db = pyzine.get_connection()
>>> c = db.cursor()
>>> c.execute("select name, run/total from triathlon order by 2 desc")
>>> while 1:
...  row = c.fetchone()
...  if row is None: break
...  print row
...
('SAM', 0.2102999985218048)
('ARAGORN', 0.1738000065088272)
('GANDALF', 0.1737000048160553)
('GALADRIEL', 0.16949999332427979)
('ARWEN', 0.14190000295639038)
('MERRY', 0.1404000073671341)
('LEGOLAS', 0.12680000066757202)
>>> c.close()
>>> db.close()

Since we inserted the data into a database we can use the database’s tools to run queries:

mysql> select name, bike/total from triathlon order by 2 desc;
+-----------+------------+
| name      | bike/total |
+-----------+------------+
| LEGOLAS   |     0.8270 |
| MERRY     |     0.7898 |
| ARAGORN   |     0.7731 |
| ARWEN     |     0.7657 |
| GANDALF   |     0.7424 |
| GALADRIEL |     0.7187 |
| SAM       |     0.7108 |
+-----------+------------+
7 rows in set (0.00 sec)

Putting it all together

Our work executed:

def go(fn):
  # parse
  r = parse(fn)

  # store
  store(r)

  # query
  strongest_bike()
$ jython
Jython 2.1 on java1.3.0 (JIT: null)
Type "copyright", "credits" or "license" for more information.
>>> import pyzine
>>> pyzine.go("pyzine.txt")
('LEGOLAS', 0.8270000219345093)
('MERRY', 0.7897999882698059)
('ARAGORN', 0.7731000185012817)
('ARWEN', 0.7656999826431274)
('GANDALF', 0.7423999905586243)
('GALADRIEL', 0.7186999917030334)
('SAM', 0.7107999920845032)

So it looks like all of our work paid off since we are now able to answer our question and the strongest relative cyclist is Legolas.

Conclusion

A number of topics were skipped in building these scripts, in particular: transactions, DataHandlers, more complex types like BLOBs and CLOBs, zxJDBC extensions, stored procedures, threads, dbexts, isql.

The full source for the examples in this article be found here.

Footnote

This article originally ran in pyzine but appears to no longer be online.

comments powered by Disqus

Recent Posts

About

I'm not here.