Marty Fuhry [A Python Summer of Code]: June 2009

Tuesday, June 30, 2009

From Here On Out, It's Math

The code is coming along nicely.

>>> p.parse_date("01/01/1980", "Y")
(10L, 9L)

Real simple. First, we import the parsedates module, then we set the callback function to the mxDateTime Parser. The parse_date function takes a String of a date, which it passes to the mxDateTime Parsing function and a String for a frequency, which it converts into an int (to be stored internally). The Parsing function for the date returns a Python DateTime object (which, for my use, is basically a tuple filled with (year, month, day, hour, minute, second, etc.). I can extract this and pass all of these into a master function to calculate the date. The frequency is taken from the second String (and proper error messages are awarded in the event of a bizzare frequency) and stored internally (as an int for now).

A few things will need to be changed, though...

The frequency will need to be parsed, too, though, since Travis needs to support "custom" frequencies (read more about them here). Perhaps this will call for a second Python callback function, as parsing Strings with regular expressions is relatively easy in Python and difficult at best in C.

The mxDateTime parser returns the Python datetime object, which only supports time units up to the nanosecond (if I recall correctly...). The NumPy DateTime module must support units as high as femtoseconds. Hopefully this will be doable with just a couple of lines of code added to the parser.

Thursday, June 25, 2009

Parsed!

The callback worked! The code previously posted only had to be slightly modified.

parsedates.set_callback(timeseries_parse.DateTimeFromString)

I have the mxDateTime Parser modified from the Scikits Timeseries imported here as timeseries_parse. In that program is a magical date parsing function called DateTimeFromString (and a similar DateFromString) which takes a string and returns a Python datetime object filled with the correct date amounts.

parsedates.parse_date("01/01/2001")
datetime.datetime(2001, 1, 1, 0, 0)

So here we see a datetime object with (Year, Month, Day, Hour, Minute, Second). Turning that into a long number is very easy and it all depends on the frequency metadata. If our frequency is in years, then our number is (Year - 1970) . We can convert this data into a long value very easily.

Tuesday, June 23, 2009

Callbacks

Sometimes you need to run a Python code segment from C in your module. There's a lot of good reasons to do this. C doesn't have much support for writing regular expressions, while Python is pretty robust. You can take a PyObject with a C string and send it to a Python parsing function. When the Python code is done manipulating the PyObject, you have it back safe and sound in C.

Writing callback functions is pretty easy.

static PyObject *callback = NULL;

static PyObject *
set_callback(PyObject *dummy, PyObject *args)
{
PyObject *result = NULL;
PyObject *temp;

if (PyArg_ParseTuple(args, "O:set_callback", &temp))
{
if (!PyCallable_Check(temp))
{
PyErr_SetString(PyExc_TypeError, "parameter must be callable");
return NULL;
}
// Reference to new callback
Py_XINCREF(temp);
// Dispose of previous callback
Py_XDECREF(callback);
// Remember new callback
callback = temp;
// Boilerplate to return "None"
Py_INCREF(Py_None);
result = Py_None;
}
return result;
}

This function takes a dummy object and some arguments. It stores a callable function into a global PyObject named callback. You can later use this global PyObject with the callback function stored in it like this:

result = PyEval_CallObject(callback, arglist);

Result is a PyObject (pointer) with the result of the callback function stored in it. Here's a pretty simple example:

def add_ftn(a,b):
return a + b

parsedates.set_callback(add_ftn)

We set the callback PyObject to store the callable Python function add_ftn. We can test the callback by running the code above with the PyEval_CallObject(callback, arglist). This C method will take arguments (for add_ftn, we need two) and send them to the callback function stored in the global PyObject variable callback.

Thursday, June 18, 2009

Enthought and Other Developments

In the words of my Mentor, Pierre,

Enthought is a private company based in Austin, TX, founded by Eric Jones, a long-time Pythonista. Enthought's main source of revenue is the programming of specific scientific application.

Enthought recently had a client request a datetime type exactly like the type I've been working on. Enthought is a prominent contributer to NumPy. Travis Oliphant will be working on the new datetime dtype, himself. This is the guy who literally wrote the book on NumPy. And I get the privelage of assisting him.

This is really a godsend, since my knowledge of low level NumPy is quite sparse. This is the nature of open source, it would seem: collaboration between the ignorant and learning (me) and the experienced brilliance which created the foundations and core of these massive projects.

I've been commissioned to write two sets of code. The first set is to get and set datetime members of the narrays. The second will be incorporating the mxDateTime Parsing module for strings to datetime conversions.

More on those later.

Thursday, June 11, 2009

A Sparse Documentation

A bit of a frustrating last couple of days. I've been all over the place with my research, which means I kept getting lost and confused. First I tried to figure out how to incorporate a scalar datatype into a NumPy dtype. Then I got lost trying to learn some more advanced Python C-API on object handling. I couldn't figure out quite how to make my datetime object play nice with frequencies. So, off I went venturing into the land of the NumPy C-API.

The documenation is very well written, but not exactly geared towards my project. I read, "The best way to truly understand the C-API is to read the source code." So I ran right to the source. The last few days have been a marathon of running over code and trying to understand how everything fits together.

Communication is key. Yeah, we say that for relationships and other insignificant things, but I'm talking about source code. The NumPy source code has a significant amount of C code, which can be overwhelming at times. I've decided the only way to understand what's going is to document it for myself. I've been literally running through each significant file and using pen and paper to pin down exactly what's going on. I'm interested in anything related to the Array Scalar Types, specifically the LongLong type. Both the datetime and timedelta types will very similar to the LongLong type. There are key differences, which I guess I should talk about. But I'll save that for tomorrow. More on the NumPy source code:

There are some fancy repetition techniques employed in the definitions of these scalar types. Commented out above each "generic" method is a comma separated list of each scalar type which. The "generic" methods have a variable (macro?) to replace the name and repeat for each scalar type in the comma separated values. I know how it works, but I don't know why.

There's been a very important update on my project. More on that later this week.

Monday, June 8, 2009

Scalar Objects VS dtype

Just a clarification:

When I store an object from a narray with dtype (for example) float64, this means the stored variable is of scalar object of float64's .type attribute.

>>> somearray = numpy.array([1,2,3,4,5], dtype='float64')
>>> somefloat = somearray[0]
>>> type(somefloat)
<type 'numpy.float64'>

So numpy.float64 is a scalar type. The dtype is only there to tell the numpy array how to handle the data in the array.

dtypes

A NumPy dtype is not a normal Python object. Dtypes tell the narray how to interpret the array. A dtype is a way to specify exactly what every member in the narray is.

>>> numpy.array([1,2,3,4,5], dtype='int32')
array([1, 2, 3, 4, 5], dtype=int32)

The NumPy array is able to take a list of data [1,2,3,4,5] and a dtype to refer to that data (dtype='int32'). When I create this narray, the dtype='int32' makes the list of data be interpretted as a list of 32 bit integers. See what happens when I change the dtype to a float:

>>> numpy.array([1,2,3,4,5], dtype='float64')
array([ 1., 2., 3., 4., 5.])

The data inside of the narray is now interpreted as an array of 64 bit floats. My goal is to make a new one of these dtypes.

>>> numpy.array(["12-3-2009"], dtype='datetime64[D]')
array([12-3-2009])

I've been working on creating some kind of separate module with datetime64 as a scalar object type. This is not the project goal. I need to create a numpy array dtype datetime64 and timedelta64 for use in the narrays. I've been sifting through NumPy's core code all weekend and can't seem to find the file(s) where dtypes are referred to. My current plan is to take these already created dtypes' chunk of code and copy paste so I can start with something barebones and work my way up

Wednesday, June 3, 2009

Parsing

Yuck. I've hit a wall and it hurts.


if (! PyArg_ParseTupleAndKeywords(args, kwds, "OO", kwlist, &obj_time, &obj_freq))

This line of code should take the args sent to the function, use whatever kwds were supplied to identify those arguments, and parse them as two PyObjects. The kwlist is used to identify which kwds refer to which recipients of the PyObject variables.


 static char* kwlist[] = {"time", "freq", NULL};

Let's give it this input.


>>> print d.datetime64(time='1', freq='1')

We've created a datetime64 object and (allegedly) given it values 1 for both time and freq. This should parse so that we create two PyObjects with values '1' and the different kwds ("time" and "freq"). I should be able to extract those values by simply checking the appropriate PyObjects for their respective keywords. But, alas, if life were easy, it would be boring.


 if (PyObject_HasAttrString(obj_time, "time"))

This resolves to false. Why? This is the exact same implementation as the TimeSeries Date type. I mean, I almost copied this. I don't understand why the arguments are being completely discraded. I can very easily parse to something else, like longs or ints, but that wouldn't that just be a workaround? Maybe not...

I'll be trying to parse:


if (! PyArg_ParseTupleAndKeywords(args, kwds, "iL", kwlist, freq, time))

Where do the keywords go, now? I assumed they were placed into some kind of magical PyObject slot. But since I'm parsing to an int and a long long int ("iL"), I wonder what happens to the keywords?

Yes, I've been neglecting writing my Unit Tests, I know. But this is just so darned frustrating. Whether or not the Unit Tests even exist, I need to be able to create datetime64 types with appropriate values. I need to be able to comfortably be able to make datetime64 types with different values before doing anything fancy.

This is important, I promise. Now, off to parse.

Monday, June 1, 2009

datetime64 Objects

I love having working code. Once I can get something to properly work at the most basic level, I slowly modify it from the ground up. The Python C API does not make this easy. In order to create even the most basic Python Object from a C module, you need to be acquainted with a host of obscure and often arcane code segments. I'll try to piece together the basic datetime64 object here.

First, as always, include the Python C API

#include <Python.h>

I'm a little confused about this line, but I think it just tells the compiler how to interpret "datetime64Type" as a PyTypeObject.

staticforward PyTypeObject datetime64Type;

The following is the actual datetime64 object, itself. We have a simple struct filled the PyObject_HEAD (a macro to put in the reference to the object's location, I think), freq (to tell us the frequency time refers to), and time (number of freq since the epoch, granted I use Unix Time).

typedef struct
{
PyObject_HEAD // macro used for refcount & pointer
int freq; // frequency of date_value
long long time; // 64 bit time since epoch
} datetime64;

We need to deallocate datetime64 objects. Later, we tell Python to use this function for just that.

static void
datetime64_dealloc(PyObject* self)
{
PyObject_Del(self);
}

Here we give Python a bunch of information about the datetime64 Object Types. We tell it the size of the object, the name, what to run when it's deallocated, the documentation, and other (sometimes irrelevant and no longer used) information.

static PyTypeObject datetime64Type = {
PyObject_HEAD_INIT(NULL)
0, /*ob_size*/
"datetime64.datetime64", /*tp_name*/
sizeof(datetime64), /*tp_basicsize*/
0, /*tp_itemsize*/
datetime64_dealloc, /*tp_dealloc*/
0, /*tp_print*/
0, /*tp_getattr*/
0, /*tp_setattr*/
0, /*tp_compare*/
0, /*tp_repr*/
0, /*tp_as_number*/
0, /*tp_as_sequence*/
0, /*tp_as_mapping*/
0, /*tp_hash */
0, /*tp_call*/
0, /*tp_str*/
0, /*tp_getattro*/
0, /*tp_setattro*/
0, /*tp_as_buffer*/
Py_TPFLAGS_DEFAULT, /*tp_flags*/
"datetime64 objects", /* tp_doc */
};

The PyMethodDef is an array of every method we can use on objects of datetime64. We put in the name of the method, the function to call, the METH_VARARGS alias, and a description of the method. We could put (for example) "add" as an entry so that we can perform the operation "object.add()". This tells Python where to go when we call each method. The NULL references at the end are a sentinel for Python to know when we're done referencing methods.

PyMethodDef datetime64_methods[] = {
{NULL, NULL, 0, NULL}};

Here's the big, important part of the code. We use the PyMODINIT_FUNC preprocessor directive to tell Python that this is our initialization function. Python will run this when we initialize objects of datetime64.

PyMODINIT_FUNC
initdatetime64(void)
{

Here we create the date_object using a PyObject.
PyObject *date_object;initdatetime64(void)
{

PyObject *date_object;

Since we're not really doing anything fancy with our datetime64 objects yet, we use PyType_GenericNew to make a generic Python Object and store it under our datetime64Type.tp_new variable. Remember, the datetime64Type tells Python what kind of object a datetime64 object is. When we make the tp_new a generic type, we don't tell it much, but we at least set a type for it.

datetime64Type.tp_new = PyType_GenericNew;

These lines will initialize the datetime64 object, and make sure it's a legit object. We'll return, otherwise.

if (PyType_Ready(&datetime64Type) < 0)
return;

These next lines are possibly the most important method. Py_InitModule3 will create a new module object based on a name and table of functions. We give it "datetime64" to tell it the name, and the datetime64_methods array to tell Python what methods we can run on it.

date_object = Py_InitModule3("datetime64", datetime64_methods,
"DateTime64 module that creates a DateTime64 Object");

Tell Python to increase the reference count for this type.

Py_INCREF(&datetime64Type);

Add the datetime64Type to Python's module dictionary.

PyModule_AddObject(date_object, "datetime64", (PyObject *)&datetime64Type);
}

There you have it. Let's run the build and install (install so I don't have to go looking for the .so file the setup.py file creates) and see if it worked.

>>> import datetime64 as d
>>> day = d.datetime64()
>>> day
<datetime64.datetime64 object at 0x7f20d28350f0>

Looks like a Python Object to me! Since we didn't give Python any methods to run on it, and since the initialization of the object doesn't actually give the object any parameters to set, we can't do a whole lot with it... But hey! We can make them, right? I'll be defining Unit Tests in the next day or so and posting them here, so keep an eye out.

Next up for the day, timedelta64!

Marty Fuhry [A Python Summer of Code]