Justin Gordon and Programming

disqus · September 28, 2014, 8:10pm

My biggest contribution to the product was my leadership and execution of a dramatic change in the core storage architecture. Originally, the data in the product was stored in what’s known as an Entity–attribute–value model. As the Wikipedia article puts it: “The Achilles heel of EAV is the difficulty of working with large volumes of EAV data.” The team bounced around some ideas of how we might solve this problem such as storing blobs of XML data. Then we struck upon the idea of a hierarchical indexed binary representation of the data akin to how a database or file system works. Around this time, I heard about the agile approach from John Seybold, the CTO of Guidewire. John especially espoused the benefits of doing true Test Driven Development, or, as he called it, “Test First Programming”. With an idea and a technique, I took the ball and ran, and the result was a dramatically improved storage architecture for storing vast quantities of structured, hierarchical data, which we called “serialization”.

So how good is the performance of “serialization”? Pretty damn good. Good enough that the performance and reliability have overcome the concerns of many architects within IBM with the conventional mode of thinking that a small team could not have built a storage architecture that could compete with products like IBM’s DB2 XML. The performance of the “serialization” mechanism for reads is O(N) where N is the depth of the tree. The memory usage is similarly good. There is only a tiny amount of memory used to navigate the tree. Not only can values can be read from such a data structure nearly instantly, but the time to write a value back into the tree is similarly fast. In comparison to DOM parsing of huge XML documents, and the cost to re-serialize large DOMs, there was simply no comparison.

The use of Test Driven Development was critical for this project. For one, any errors in the algorithms turned out to be as obscure as working on C/C++ code due to the nature of the binary manipulations. Then, the original implementation was slow, which is not surprising in retrospect. A fast arsenal of unit tests, however, enabled me to make dramatic changes to the internal algorithms to get the blazing speed needed. It really was like a secret sauce to have used TDD plus Pair-Programming to develop an awesome test suite, and to use Yourkit to profile and optimize.

Upon the completion of the storage mechanism, I realized that searching for the data inside these binary objects was the next major problem. This led to my patents Method and system for data retrieval using a product information search engine and STORAGE AND RETRIEVAL OF VARIABLE DATA. The first part is a query language, similar to SQL, used for finding data. The second part describes the use of XML database records as shadow copies of the binary storage, which can be written asynchronously to avoid slowing down realtime reading and writing. This way we can have our cake and eat it too, with super fast reading-writing of the structured documents through “serialization” plus the ability to query and export the XML copies of the data.

This is a companion discussion topic for the original entry at http://www.railsonmaui.com//about/about-justin-gordon-programming.html