Why The Technology Matters

networkThere are 230,000,000 small and medium-sized businesses globally.  Less than 1% of these companies will go public.  Instead, these companies will need to access private capital markets both to finance growth and also to exit.  These private capital markets are mysterious, relationship-driven, offline, and unmeritocratic.  Axial’s search engine and online network helps business owners & private company CEOs navigate these capital markets and develop the necessary connections they need to successfully finance or sell their company.

This isn’t a problem if you’re Exxon, Apple, Google, Procter & Gamble, 3M, etc., it’s easy to access efficient capital markets. The world’s best investment banks, global law firms and big five accounting firms are beating down a path to your door to advise you and help you do that. And if you’re a young tech startup, the path has also become fairly straightforward (albeit not easy). You can visit the right VC suspects in Silicon Valley, NYC or Boston and you’ll be able to get a good sense for what kind of user or revenue traction you’ll need to access their pools of venture capital.

But what about “normal” entrepreneurs running one of these 228,000,000 companies? The family-owned $27M revenue wood pellet manufacturer in Nebraska? The owner-operated $4M revenue surgical supplies distributor in Georgia founded in the 1970s? How do these entrepreneurs successfully access capital? Who helps them find the right capital partners? Who helps them find their M&A lawyer, who helps them identify and select the right investment banker to advise them on key deal terms, valuation, and helps them drive towards a successful outcome?  The answer is Axial. We build the network that connects business owners to the capital markets.

The Axial Search Quality Challenge

Just like Google, LinkedIn, Yelp, Amazon, or Zillow, we’re constantly trying to solve for search quality. Search quality for the Entrepreneurial Economy has three fundational components:

 1. Comprehensive and accurate intent data.

The perfect search engine for the entrepreneurial economy would have investment criteria and risk profiles for every capital provider. It would know the perfect client profile for every professional advisor. It would know what each entrepreneur needed to take their next step, right now.

2. Exhaustive profile and historical behavioral data.

The perfect search engine for the entrepreneurial economy would have a complete understanding of the historical financial performance of every company, would know how they’ve grown in the past, would know who’s on the board, the leadership team, and how each of those key players had performed in their current and previous roles. It would know every investor at each fund, where they were before, and all the investments they’d made in their career. It would know the industries and size of transactions that a professional advisor has worked in.

3. A Great Algorithm.

The perfect search engine for the entrepreneurial economy would take intent, augmented by historical behavior and performance, and recommend the right organizations and professionals for any entrepreneur running any kind of company, and go on to tell you the best way to connect with them.

The Importance of Trust

Where it gets even more technically challenging is when you overlay the personal relationship element of all of this. The transactions that are getting done on Axial are confidential and sensitive. It’s someone deciding to sell their company or obtain additional capital. These decisions aren’t taken lightly or shared broadly. Ensuring that these entrepreneurs are successful isn’t only about creating highly relevant search results, it’s about helping them figure out the best path to connect with people through trusted mediums. It’s about quality, and about avoiding spam and noise. It’s about making sure people are responsive to one another.

So in addition to building a great search engine, we focus on building a well-behaved network of professionals who engage on the right opportunities, observe best practices, and are incentivized through the design of our software to do so. That means sophisticated implicit and explicit reputation systems, that means a product that makes it really easy to do things the right way and hard to do things the wrong way.

Welcome to Axial. We focus on solving one of the world’s most important problems — connecting entrepreneurs to the capital markets at massive scale.

[video] Fast + Predictable

After talking about “Predictability at Axial” in January, the crew over at the NYC Startup CTO Summit asked me to give an updated and abridged presentation on how we build software here at Axial earlier this month. As part of the process infrastructure topic and panel along-side experts from GitHub and InfluxDB.

Axial Lyceum: A Cognitive Psychologist’s Approach to Data Mining

Join Maggie Xiong on 4/22 to learn about how our brains work, and why an understanding of memory and cognition are vital when writing predictive algorithms.

Maggie XiongWhere: Axial HQ
When: Tuesday, April 22nd, 6PM
RSVP: via Meetup

The first time I heard about Maggie was through a few shared co-workers who referred to her only as “Dr. X”. Of all the stories I heard about “Dr. X” — and there were many — one always stood out: her solo effort in the Netflix Cinematch competition. I was lucky enough to meet Maggie for the first time a few months later, and she graciously took a little time to explain to me how her efforts on the Netflix competition emerged from her Ph.D work in cognitive psychology, in particular from her understanding of the inner-workings of memory and cognition.

A few years after I first heard that tale of “Dr. X” and the Netflix Prize, Maggie will be stopping by the Lyceum to do a much deeper dive into her approach, methodology and results from her effort on beating Cinematch. Brace yourself for a heaping helping of statistics, neuroscience and – fair warning – Perl.

The Gist

There’s more to data mining than scikit-learn or mahout. Why does collaborative filtering work? Why does the ensemble approach work? Come to think of it, how do we think?

This very ambitious(!) talk will try to ground predictive analyses of consumer behavior in what cognitive scientists have learned about human memory and cognition, and demonstrate through the Netflix prize project that the prediction of human behavior can benefit greatly from an understanding of human memory and cognition.

More About Maggie

Maggie leads the data engineering and machine learning teams at the Huffington Post Media Group, and has a Ph.D in cognitive psychology from Vanderbilt University. Prior to HuffPost, Maggie led the algorithm team at Shutterstock, where she created and optimized image search ranking algorithms.

unicode ^ str

unicodePerhaps the nicest thing you could say about Python 2′s attempt at unicode and str interoperability through implicit coercion is that it forces programmers to come to terms with the difference between unicode code-point strings and unicode character set encoded byte strings. Take the following example:

>>> u'中国'.decode('utf8')
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

As decode exists to turn encoded byte str objects to code-point unicode objects, calling decode on a unicode object should ostensibly be a noop. In reality what happens is that Python coerces our unicode object to a string (using encode), opting to ignore the passed decode codec (utf8) in favor of the default codec (ascii), raising a UnicodeEncodeError and causing a lot of confusion in the process. The good(?) news is that had Python been able to encode with ascii, it would have decoded using utf8.

Just the Beginning

You might be asking yourself: Why would I ever call decode on a unicode object in the first place?

>>> def an_innocuous_method(args, delimiter=u'_'):
...     '''Any unicode objects will cause join to return a unicode objects,
...        implicitely `decoding` as needed.'''
...     return delimiter.join(args)
...
>>> # unicode object works just fine ...
>>> an_innocuous_method([u'中国', ''])
u'中国_'
>>> # string fails with decode
>>> an_innocuous_method('中国', '')
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 4, in an_innocuous_method
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

At this point we’ve learned the dangers of mixing unicode objects and str objects in Python and to solve the problem we need to always be using only str objects or unicode objects exclusively. In attempting to follow-through on this strategy you might be tempted to decode all of your arguments, which would fix the UnicodeDecodeError, and result in our original UnicodeEncodeError. A safe implementation looks more like this:

>>> def an_innocuous_method(args, delimiter=u'_', encoding='utf8'):
...     '''Any unicode objects will cause join to return a unicode objects,
...        implicitely `decoding` as needed.'''
...     return delimiter.join([i.decode(encoding) if isinstance(i, str) else i for i in args])
>>> an_innocuous_method('中国', '')
u'中国_'

But the combination of type-checking and decoding on anything that could possibly be a str is time-consuming for you, ugly and inefficient for your code, and prone to PEBKAC errors which are difficult to catch.

Make it Easy to Do Right Way

If you choose to go unicode for everything, you want to start by converting all of your string literals to unicode literals:

# in each module
>>> from __future__ import unicode_literals
>>> foo = ''
>>> foo
u''

For each method you then want to be able to safely decode any str objects passed to your methods to unicode objects (or vice-versa), for which a unicodify (or stringify) decorator would work great:

@unicodify(charset='utf8')
def a_truly_innocuous_method(args, delimiter='_'):
    return delimiter.join(args)

For which the code would look something like:

'''Decorators to convert all arguments passed to a function or method to
   unicode or str, including default arguments'''
import sys
import functools
import inspect

def _convert_arg(arg, from_, conv, enc):
    '''Safely convert unicode to string or string to unicode'''
    return getattr(arg, conv)(encoding=enc) if isinstance(arg, from_) else arg

def _wrap_convert(from_type, fn, encoding=None):
    '''Decorate a function converting all str arguments to unicode or
       vice-versa'''
    conv = 'decode' if from_type is str else 'encode'
    encoding = encoding or sys.getdefaultencoding()

    # override string defaults using partial
    aspec, dflts = inspect.getargspec(fn), {}
    if aspec.defaults:
        for k,v in zip(aspec.args[-len(aspec.defaults):],aspec.defaults):
            dflts[k] = _convert_arg(v, from_type, conv, encoding)
        fn = functools.partial(fn, **dflts)

    @functools.wraps(fn.func if isinstance(fn, functools.partial) else fn)
    def converted(*args, **kwargs):
        args = [_convert_arg(a, from_type, conv, encoding) for a in args]
        for k,v in kwargs.iteritems():
            kwargs[k] = _convert_arg(v, from_type, conv, encoding)
        return fn(*args, **kwargs)

    return converted

def unicodify(fn=None, encoding=None):
    '''Convert all str arguments to unicode'''
    if fn is None:
        return functools.partial(unicodify, encoding=encoding)
    return _wrap_convert(str, fn, encoding=encoding)

def stringify(fn=None, encoding=None):
    '''Convert all unicode arguments to str'''
    if fn is None:
        return functools.partial(stringify, encoding=encoding)
    return _wrap_convert(unicode, fn, encoding=encoding)

__all__ = ['unicodify', 'stringify']

View Gist on Github

Kill Yr Features

Kim Gordon of Sonic Youth, walking on her bass, 1991. (Source: Wikipedia)

Sonic Youth fans may have missed one of the bands earliest gems called “Kill Yr Idols“. The lyrics are simple: don’t worry about impressing the critics, find out the new goal. For writers, the phrase “murder your darlings” by English author Sir Arthur Quiller-Couch applies in the same way:  remain objective and  strive for improvement regardless of sentiment or precedent. For technology, there is no shortage of  visionaries who’ve all reinforced the importance of pushing of ideas and starting over to get things just right. Every now and then you’re presented with the opportunity (or, the necessity) to take something born in a lot of blood, sweat and cursing, and tear it down to the foundation and start again.

Rebuilding pieces of your application to fit new needs is nothing new. It should be core to any team to want to build it better and scalable. It’s an opportunity to continually re-evaluate the UX and product decisions you made before and improve upon them. Here’s a perfect example of why anyone involved in planning or developing products should embrace that opportunity, every time.

1 idea, 3 wizards, too many design changes, too many bugs

It started about half a year ago as a grand plan for a single wizard style and function to help our users create and edit their company profile, investment and deal criteria. We designed it and built it from scratch with a new UI, validation engine, and branching logic to help guide users through the process. We covered all three types of wizards in the design, but we started small (or so we thought) by focusing on the first of the three wizards. We tried to keep scope manageable by iterating, and chose to maintain some of the original (older and rigid) architecture. As the first wizard began taking longer than we thought it would, we cut scope, and the end result, while well received, was sort of anemic, and we were still supporting 2 workflows. We immediately started on the next wizard on an even more complex and key piece of the application. What started as identical twins became fraternal twins, and then became bad dopplegängers. Each implementation had its own requirements and circumstances, and the result of our first two attempts was the only thing worse than two specific implementations … two competing generic implementations, neither of which was generic. By the beginning of this year, we were in the same position we started in: separate wizards, different methods of validation, with conflicting and replicated stylesheets. We needed to take a step back.

Time to Pull The Plug

Despite our best intentions, nothing we had built before was truly re-usable without an overly complex re-configuration. Rather than go down the same path (again) for the next round of improvements needed, we took a step back across the board. We used the two existing implementations, and the third proposed one to need consolidate on a more consistent and clear UX, graphic design and wizard architecture that would allow us to be generic in the database, and specific in the experience. Even though it meant revisiting everything we had just built, we started by revisiting the functionality we wanted all along and started from scratch, this time explicitly covering each of the three wizards we’d made before. Each and every single question, field and option was re-evaluated and compared. Similar functionality was combined, refined, and styled. Unique solutions we’d come up with while sweating each launch in isolation were rethought and some were ultimately scrapped completely. We totally rethought how we wanted to handle validation, in a way we could actually reuse, and in other places outside of the wizards too.

Damn the Roadmap, Full Speed Ahead!

Two weeks later, we had a master design for every element in every wizard. We had a working stylesheet and structure that covered every use-case, and we had every field in each existing and new wizard planned and accounted for. Despite a full roadmap, we made the case to revisit and rebuild, rather than just quadrupling-down on our previous work, further bifurcating the experience and code along the way, and we decided to spend the extra time to rebuild it all using what we’d learned along the way. We tackled the simplest wizard first, and quietly shipped a reboot without anyone really noticing. The newest and most complex wizard got all of the new functionality that the roadmap had called for, and along with it, replaced the bad decisions we’d made before, all in about half the time it took to re-build the first wizard. The third wizard is planned to take even less time, and will have taken us about 6 weeks of development. We’ll have redesigned and rebuilt in 2 months what originally took 6 to build, and resolved numerous headaches. Plus, our original goal of using the same logic and styling for all wizards has truly been realized.

The process has also gotten us thinking about other parts of the application that have suffered the same fate: development and design without enough focus on reusability. These pieces could, and more importantly, should be much more closely related. Thousands of lines of one-off code and design inconsistency and conflict could be eliminated. As a team, this experience has improved the communication between product and engineering, and encouraged us to think about reusability and application implications in a more specific, defined, and calculated way. Like listening to “Kill Yr Idols,” a harsh experience leads to serenity and desire for self-improvement. What better reason to take the machete to an idol of your past?

Default to Tuple

python-data

Much has been written on why you shouldn’t default to mutable values when defining functions in Python, often leading to code that looks like this:

def iter_dflt_none(iter_=None):
    '''Default to None is one common pattern'''
    iter_ = iter_ or []
    return iter_

null = object()
def iter_dflt_sentinel(iter_=null):
    '''Default to null sentinel to allow None to be passed as a valid value'''
    iter_ = [] if iter_ is null else iter_
    return iter_

While there is nothing wrong with either of these battle-tested patterns, both
involve highly repetative statements at the top of each defaulted function to
default the non-iterable default value back to an empty iterable (typically an
empty list), when we could just avoid this code entirely by defaulting to
list‘s immutable cousin tuple:

def iter_dflt_tuple(iter_=tuple()):
    '''Just default tuple, and it's already iterable!'''
    return iter_

Axial Lyceum: On Punch the Monkey — A deep dive into online ad auctions and systems

Join Sandeep Jain on 3/25 to learn the ins and outs of online ad auctions.

Technical Advisor @axialcorps

Where: Axial HQ
When: Tuesday, March 25th, 6PM
RSVP: via Meetup

Ever wonder why you keep getting ads for Budweiser when you’re clearly a Coors aficionado? In this Lyceum, Sandeep will dive into the algorithms and systems that decide how ads are delivered on the internet. Never again will you wonder why you’re being hawked bad beer. This Lyceum will mix behavioral economics, game theory, distributed systems, and graph theory into one fun and informative talk.

More About Sandeep

Sandeep is currently a technical adviser to Axial. Before that, he cofounded Reschedge, a SaaS enterprise recruiting tool which was recently sold to Hirevue. He started his career at Google where he spent 5 years working on Google Maps and Doubleclick products. He finished his career there as the technical lead of the display advertising backend.

GLitter – A Simple WebGL Presentation Framework

Example GLitter Page

Example GLitter Page

A couple of weeks ago, I gave an Axial Lyceum talk on WebGL, a technology that allows hardware-accelerated 3D graphics to be implemented in the browser using pure javascript code. Before I came to Axial, I was doing a lot of 3D graphics work, which is how I came to learn WebGL. I had a moment of panic a few days before my talk when I realized it had been almost a year since I’d done any WebGL programming and I was feeling a little rusty. I was about to fire up powerpoint and start creating what probably would have been a boring presentation when I had a flash of inspiration– I could implement the slides for my WebGL talk directly in WebGL!

This both forced me to get back into the swing of WebGL and resulted in a much more engaging and interactive presentation. I used the fantastic Three.js library to make a simple framework for my presentation. After my talk, a few of the attendees asked if they could get a copy of the code that I used for the presentation, so I spent a little time making the code a bit more modular and it is now available at http://github.com/axialmarket/GLitter/. Before I explain how the presentation framework works, a little background on three.js is necessary.

WebGL is a very powerful technology, but it is also very complicated and has a steep learning curve. Three.js (http://threejs.org) makes creating 3D graphics with WebGL drastically simpler. While you still need some understanding of matrix mathematics and 3D geometry, Three.js abstracts away some of the highly technical details of WebGL programming like writing fragment shaders in GLSL or manipulating OpenGL buffers. In Three.js, you create a scene, a camera and a renderer object, and then use the renderer object to render each frame. To actually render something, you add 3D objects to the scene. These 3D objects have a geometry and materials, a transformation matrix that specifies translation, rotation and scaling relative to the object’s parent, and the ability to contain other 3D objects.

With GLitter, you define a “Page” object for each step in the presentation that creates the 3D objects and behaviors needed to implement that step. The GLitter “Scene” object manages the Three.js scene, camera and renderer and implements transition logic for switching between steps, and provides some common keypress handling functionality. One of the neat things about WebGL is that it renders inside of the HTML5 canvas so it is easy to composite the WebGL scene with HTML content overlayed on top. In GLitter, there are a few different types of HTML content you can overlay on top of the WebGL canvas. First, each Page provides a title and optionally some subtitle content. Second, GLitter uses the dat.GUI control framework to allow Page objects to easily add controls for properties of javascript objects. Lastly, GLitter provides an “info” interface that can be used to show dynamic content.

To see how this works in practice, let’s create a presentation with two steps. The first will show a spinning cube and provide controls to change the cube’s size, and the second will show a sphere with controls to change the camera’s field-of-view, aspect ratio and near and far clipping planes. On the first step, we will show the cube’s transformation matrix in the “info” overlay and in the second, we will show the camera’s projection matrix.

We first create GLitter Page objects CubePage and SpherePage:

var CubePage = new GLitter.Page({
    title: "Cube",
    subtitle: "Spinning!",
    initializor: function (scene) {
        var context = {};
        var cubeMaterial = new THREE.MeshLambertMaterial({ color: 0xee4444});
        var cubeGeometry = new THREE.BoxGeometry(1, 1, 1);
        context.cube = new THREE.Mesh(cubeGeometry, cubeMaterial);
        context.cube.position.z = -2.5;
        scene.add(context.cube);

        var spin = function() {
            new TWEEN.Tween(context.cube.rotation)
                     .to({y: context.cube.rotation.y - 2*Math.PI}, 2000)
                     .start()
                     .onComplete(spin);
        }
        spin();
        scene.add(new THREE.PointLight(0x999999));
        scene.camera.position.z = 2.5;
        return context;
    },
    finalizor: function() {
        GLitter.hideInfo();
    },
    updator: function (context) {
        return function (scene) {
            GLitter.showInfo(GLitter.matrix2html(context.cube.matrix));
            return ! context.STOP;
        }
    },
    gui: function (scene, context) {
        scene.gui.add(context.cube.scale, 'x', 0.1, 5);
        scene.gui.add(context.cube.scale, 'y', 0.1, 5);
        scene.gui.add(context.cube.scale, 'z', 0.1, 5);
    }
});
var SpherePage = new GLitter.Page({
    title: "Sphere",
    initializor: function (scene) {
        var context = {};
        var sphereMaterial = new THREE.MeshLambertMaterial({ ambient: 0xee4444 });
        var sphereGeometry = new THREE.SphereGeometry(1, 20, 20);
        context.sphere = new THREE.Mesh(sphereGeometry, sphereMaterial);
        context.sphere.position.z = -5;
        scene.add(context.sphere);

        scene.add(new THREE.AmbientLight(0x999999));
        return context;
    },
    finalizor: function() {
        GLitter.hideInfo();
    },
    updator: function (context) {
        return function (scene) {
            GLitter.showInfo(GLitter.matrix2html(scene.camera.projectionMatrix));
            return ! context.STOP;
        }
    },
    gui: function (scene, context) {
        var upm = function(){scene.camera.updateProjectionMatrix()};
        scene.gui.add(scene.camera, 'fov', 1, 179).onChange(upm);
        scene.gui.add(scene.camera, 'aspect', 0.1, 10).onChange(upm);
        scene.gui.add(scene.camera, 'near', 0.1, 10).onChange(upm);
        scene.gui.add(scene.camera, 'far', 0.1, 10).onChange(upm);
    }
});

The “title” and “subtitle” values are pretty self-explanatory. The “initializor” function is called to initialize the page when GLitter transitions to it. It adds the desired 3D objects to the GLitter Scene object, including lights, and returns a context object holding any objects that need to be referenced later. The “finalizor” function is called just before GLitter transitions away from this page. The “updator” function is called just before every frame is rendered. The “gui” function is used to update controls in scene.gui, which is a dat.GUI object.

Note that in CubePage, the spinning of the cube is handled by the TWEEN javascript library. TWEEN requires an update call to be made in every frame, but this is handled automatically by GLitter.

Also note that the “updator” functions return “! context.STOP”. The idea here is that if the “updator” function returns a false value, rendering is paused. The GLitter Scene object intercepts ‘keydown’ events, and will set context.STOP to true if Enter or Space is pressed. In addition, if “n” or “p” is pressed, GLitter transitions to the next or previous step, respectively. Page objects can add handling for other keypresses by defining an ‘onKeydown’ function. If this function returns a true value, then the standard GLitter keypress handling is skipped.

Now that these Page objects are defined, we can create an HTML page that sets up the basic structure GLitter needs and loads all of the required files. Currently GLitter is not a javascript module, so we load all of the files explicitly:

<!DOCTYPE html>
<html>
  <head>
    <title>GLitter Blog Example</title>
    <meta charset="utf-8"> 
    <link rel="stylesheet" type="text/css" href="example/example.css">
    <script src="//rawgithub.com/mrdoob/three.js/master/build/three.js"></script>
    <script src="lib/tween.js"></script>
    <script src="lib/dat.gui.min.js"></script>
    <script src="lib/OrbitControls.js"></script>
    <script src="lib/EffectComposer.js"></script>
    <script src="lib/MaskPass.js"></script>
    <script src="lib/RenderPass.js"></script>
    <script src="lib/ShaderPass.js"></script>
    <script src="lib/CopyShader.js"></script>
    <script src="lib/HorizontalBlurShader.js"></script>
    <script src="lib/VerticalBlurShader.js"></script>
    <script src="GLitter.js"></script>
    <script src="Page.js"></script>
    <script src="Scene.js"></script>
    <script src="example/CubePage.js"></script>
    <script src="example/SpherePage.js"></script>
    <script src="example/example.js"></script>
  </head>
  <body>
    <div id="content" class="content">
        <div id="title" class="title"></div>
        <div id="subtitle" class="subtitle"></div>
    </div>
    <div id="info" class="info">
    </div>
  </body>
</html>

The files in lib/ are third-party libraries, and GLitter itself is comprised of GLitter.js, Page.js and Scene.js. The contents of CubePage.js and SpherePage.js are shown above, so all that’s left is example.js:

window.addEventListener('load',
    function () {
        CubePage.nextPage = SpherePage;
        SpherePage.prevPage = CubePage;
        var scene = new GLitter.Scene({});
        scene.initialize();
        document.getElementById('content').appendChild(scene.domElement);
        scene.loadPage(CubePage);
    }
);

You can see the completed example at: http://axialmarket.github.io/GLitter/example.html

There’s still quite a bit of work to do to make GLitter more powerful and easier to use,  but as you can see, GLitter is already pretty easy to use, and doesn’t look too shabby either!

[Video] Having Fun with WebGL

On 2/25, our very own Ben Holzman stopped by to teach us all a little bit about 3D graphics on the Web, using WebGL. Ben’s presentation was 100% written in a WebGL presentation framework he calls “GLitter”, which he will be releasing shortly.

Out of everything Ben managed to pack into this Lyceum, perhaps the most impressive thing we learned was just how easy it’s becoming to do 3D graphics in the browser with tools like three.js.

Introspection in SQLAlchemy: Reflections on the Magnum Opus

alchemist

The layering of orthogonal concepts within SQLAlchemy lends itself to deep introspection. These capabilities can be used for a variety of purposes including debugging and concise expression of programmatic intent. The detailed introspection API added in version 0.8 can be very useful in several scenarios. Previously, while these introspection capabilities were available, they were mostly undocumented and without official support. We’ll cover some deeper parts of this API through the investigation of an application bug and the addition of a common feature. First, though, it might be best to glance at the surface.

Transmutation Ingredients

SQLAlchemy is a comprehensive database interface tool that is split into several components. The most obvious distinction is between ‘SQLAlchemy Core’ and the ‘SQLAlchemy ORM’. Both the Core and ORM themselves are greatly subdivided into several layers, though the primary focus in this article is the ORM’s internals. In addition, it’s important to note the separation of the ORM from the declarative extension. The declarative extension adds the declarative base class and other niceties, but ultimately it is just another layer.

Session

A primary focal point of SQLAlchemy is the fabled “db session“. This object is the key to interacting with the ORM (during model usage, rather than creation) since nearly all of the heavy lifting is done in a way that is rooted to a single Session. This Session does several things that are mostly behind the scenes, but all ORM object instances ultimately hold a reference back to it.

The Session object is responsible for storing and synchronizing the in-memory python object instances with the current state that the database is in. One important shortcut (normally) taken by SQLAlchemy is to assume that all interaction with the session takes place in the context of a transaction. This allows SQLAlchemy to batch updates, maintain it’s identity map, and issue queries that return accurate results while only communicating with the database when needed.

Flushing

In common use of SQLAlchemy, communication with the database is delayed until “needed”. In particular, this means that

inst = db_session.query(MyObj).get(1)
inst.first_attr = "hello"

does not execute an UPDATE statement. In fact, the data for ‘first_attr’ is stored within a “need to flush” attribute, and then sent in an UPDATE statement when a flush occurs. These flushes are either explicit (session.flush()) or automatic (run before each query, including SELECT queries). In addition, a flush is always executed before a commit. The reason for autoflush to exist is to ensure changing an object and then querying for it returns the correct value, since before the flush, the database is unaware of in-memory modifications. In other words, if one ran the above code and then ran:

db_session.query(MyObj).filter_by(first_attr="hello")

with autoflush=False, it would not be returned, but with autoflush=True, a .flush() call would be executed first, allowing the DB to notice that this object meets the criteria.

InstanceState

Every model instance has an associated InstanceState instance, which is the actual store for these values. In particular, the “old” values ( /if they are loaded/ ) are stored on InstanceState’s .dict attribute, and the current not-yet-flushed values are stored on .committed_state (which is a somewhat confusing name). The official API to access this data, however, is via the History interface. This interface shows the old value and new value in a much more convenient way, and is gained via the inspection api.

istate = inspect(inst) returns an InstanceState. istate.attrs returns a “namespace” (dict-like object) of attribute names mapped to AttributeState instances. These AttributeState instances contain the ‘history’ attribute, which returns the History object, and is the “official” interface to the old and new pre-flush values.

Alchemical Calcination 1

In resolving bugs, one must first investigate and determine their cause. In a bug I resolved recently, a logically unchanged object was causing SQLAlchemy to emit an UPDATE clause, which caused the database to update a recently changed timestamp. In this case, an application of inspect(), InstanceState, AttributeState, and History used just before db_session.commit() was very useful in spotting the issue:

>>>dict([(k, v.history) for k, v in inspect(model_instance).attrs.items() if v.history.has_changes()])
{u'location_id': History(added=['2'], unchanged=(), deleted=[2L])}

Given a model instance, we inspect() it, which returns an InstanceState instance. This tells us about the state of the object in it’s session (pending, detached, etc), and has details about it’s attributes. Accessing the attrs attribute returns a “namespace”, which behaves more or less like a dict. It’s keys are the names of persisted attributes for our instance, and it’s values are AttributeState objects. The AttributeState object’s history attribute gives us access to a History object, which records unpersisted changes to the database. In particular, it is these history objects that contain the details of state that is pending but not yet persisted to the database via a flush operation.

It is worthwhile to note that this history API is generally only useful pre-flush, because it is during flush that an UPDATE or INSERT statement can be issued. That being said, the above could integrate quite nicely with a session before_flush listener (or simple breakpoint).

Alchemical Multiplication 2

Serialization is a common function added to many declarative base object implementations. Often it will take the name of .as_dict(), .as_json(), or even .__getstate__() for Base classes that would like to support the pickle protocol. Unfortunately, several implementations fall short of achieving various desired outcomes. For example, one may want to serialize an object to json for display on the frontend. However, as soon as different users have different logical “attribute level” permissions to view fields (eg, ‘owner’, ‘salary’, or ‘home_address’), this one size fits all approach can fall short. In addition, there are several other decisions to make – often an object has dependent children (say, a user has multiple phone numbers). In the json representation, it may be convenient to return the attribute ‘phones’ as a list of numbers rather than deal with an entirely separate UserPhone object on the frontend. In short, there’s no one size fits all solution.

That being said, here’s my one size fits all solution. It inspects an object instance and returns a serialized dict. The function is recursive by default, though that can be disabled. Many to many relationships are followed and returned as dicts or as a list of ids (depending on arguments). In addition, it takes a filter_func that is called twice per dumped object: once with a dict of attributes (before hitting the database) that can whitelist or add additional attributes to return, and then a second time with the loaded attribute values. This allows a clean logical dump with appropriate filtering based on where it’s called.

>>> dump(model_instance)
{'id': 1, 'attr_a': 'a', 'attr_b': 'b'}

>>> dump(model_instance, include_relationships=True)
{'id': 1, 'attr_a': 'a', 'attr_b': 'b', 'foos': [{'id': 1, 'bar': 123}, {'id': 2, 'bar': 456}])