/tech home / projects / personal 'blog / about


ARCHIVE: Storing nested trees in tabular databases

Thu. Jun 16th 2005, 10:22pm:

At my work, we order pretty much all of our hardware from the Dell Outlet. I have an online Dell Outlet account, my boss has one, we have a phone account that has no relation to our online accounts, and we have another online account at the main site. That's four accounts we've used to purchase computers just in the past year! From a customer standpoint, I want Dell to recognize that all of these accounts belong to my company so that next time I try to purchase a piece of hardware, the Dell salesperson realizes that we are a regular customer and offers us the best price & discounts. However, it seems like there's no way they can consolidate these accounts under one primary account.

Had this been 1995 with Dell using four entirely different systems to receive orders, I would've understood, espeically if the phone-in order account relied on some legacy mainframe. However, this is 2005 and the company in question is one of the largest hardware retailers in the world. Sadly, being unable to aggregate multiple accounts into one is not an isolated problem. Blame the tables!

The technology that drives pretty much every popular database in the world does not inherentely support consolidating or aggregating, more commonly known as hierarchical nesting. Instead databases store data in a tabular format with fixed columns under independent rows. Nested and tabular are the two most common structures to organize and store data.

Your home address is nested. Looking at it from bottom up, you have Country > State > City > Street > Name. Nesting allows you to drill down into a category for more information, say the open directory. A normal phone book is tabular, organized alphabetically by the last name for everyone in the same city/locality. The yellow pages are a mix between nested and tabular structures because you can drill down to a particular industry like City > Plumbers, but then you have an alphabetical list of every plumber in Saint Petersburg.

One thing to note here is that if all the categories in a nested structure are different, it may be more efficient to store it in a table. Since [Country] is different from [State], which is different from [City], [Street], and [Name], there is no reason why the address information for everyone in US couldn't be stored in a simple table with columns like [Name], [Street], [City], [State], [Country]. If I want the list of everyone in Florida, I just query for records WHERE [State]="FL". Nested structure is advantageous (and in fact, needed) when there is no difference between two categories, like the hierarchical way in which employees in a company are structured. President > VPs > Assistant VPs > Senior Managers > Junior Managers > Sales people. Everyone in this list is an employee of the company and from the data standpoint there is really no difference between someone who is a Sales person or an Assistant VP - it's just people. Everyone has a [Name], [Title], [Salary], and [Boss]. If you want to organize a company hierarchy, you need nested trees.

Here's another way to look at the difference between nested and tabular data. Your computer harddrive has a nested file-system, meaning you can make a folder, then save some files into it, add a subfolder with even more subfolders inside it, and store files in all of them. Compare this to an Excel spreadsheet that you use to track your gasoline mileage. You create a fixed number of columns with headers like [Date], [Gallons Filled], [Odometer: Total Miles] and [Avg. MPG]. Everytime you buy gas, you note down the information under each of the first three fields and setup some formula so that [Avg. MPG] is calculated automatically. This is a table. It's flat, has a fixed number of columns, and as many rows as you need (or Excel allows - 65536).

While a nested structure works great for storing files on disks it doesn't work that well when you need to search it for keywords, dates etc. Try searching your computer for all the files that were modified between Jun-02-2005, and Jun-05-2005. The search application will begin at the root and search recursively through every folder and its subfolders till it finds files that fall within your date range. Now try to use an online dictionary to search for some word and see how fast it retrieves the meaning. That's because it stores the data in a tabular structure and indexes it so it can jump to or very close to the exact word you are seeking. It does not have to cycle through every word between A and P when you want to know what quintessence means. If you love the whole concept of sorting, searching, and organizing, take a college-level course in Data Structures.

So how do we get the benefits and features of theoretically organizing data under nested trees while still storing it physically in SQL tables? Once again, I'm not the only one who's asked this question. There are two ways: the easy way using full paths and the hard way using digraphs.

Let's get back to the employee hierarchy within a company. The easy way means that for every employee, also store the full path. If I decide to work for Microsoft, my emploee record could be [Name] = "Chirag", [Path] = "Bill Gates > Some Rich Dude > Some Old Guy > Some Other Guy > Some Lady > Some Young Guy". Then if you want to search for all the employees who work under Some Other Guy, you just query WHERE "Some Other Guy" IN [Path]. My name will show up as well as "Some Young Guy". This really is an easy way to store something as complex as nested trees and still be able to retrieve it quickly. The problem happens when there's a corporate shakeup. What happens when "Some Rich Guy" decides to promote "Some Lady" to the level of "Some Other Guy"? All of a sudden, everyone's paths break. What happens when the entire department under "Some Other Guy" gets transferred to "Some New Guy" who is now working under "Some Lady"? The last thing you want to do is re-enter (manually or automatically) the entire path information for every employee affected by this.

The hard way means for every employee, store the span of the employee's subordinates in two columns, the first column being the employee's number itself, and the second column being the number of the last employee under the selected employee. Read the details if you want a good explanation on how this is implemented. The benefits of this method are that it is still just as easy to query for employees under a given person, with the added benefit that if there is a change in the tree structure, you have to recalculate just a few employees' spans. While recalculating this is more complex than just entering the full path, it is definitely more efficient. Especially if you want to store trees as complex as the entire corporate hierarchy of a company like Microsoft. Or who knows, the expanse that is the US Government?

While the easy and the hard way above work on all databases, there is a way that is easier than both of these but is supported by only a few databases. Oracle has a nice CONNECT BY clause in its SQL command set that makes retrieving nested data a breeze. Next to every employee and their own employee number, just store their boss' employee number. Then if you want the list of all the employees under a given employee, use the CONNECT BY clause in your SQL query appropriately and the database will return the desired recordset. However, when it comes to performance, the hard way above certainly beats this one since Oracle still has to make recursive queries internally.

One of the reasons I want my company's database systems to have support for nested trees is because I am sick of maintaining fixed number of access levels for various groups. Part of my job is to create systems (mostly web-based) that cater to each of these: Business Partners > Distributors > Sub-Distributors > Retail Stores > Owners > Sales Persons > Retail Customers > Anonymous Visitors. When someone logs in to their account, they must see different aspects of the same website and it gets really messy when I have to show them only specific sections of each page. While an Anonymous Visitor may see the retail price of a product, only a Retail Customer who has created an account should be able to see the product features and high-resolution pictures. The Owner of a Retail Store should be able to track the sales made by each Sales Person but the Distributor should only see the aggregate sales for each Retail Store, even though the sales order record contains the unique ID of the Sales Person and not the Retail Store. The sales must bubble up from the Sales Person to their Owner to their Retail Store. Trust me, it gets even more confusing when you have to calculate commissions and discounts based on a thousand other rules.

As long as it's good for business, I'm all for introducing complexities into my systems. It IS good for Dell's business to make it possible for customers to combine thier accounts into one. That way, they can give me more discounts and I will have more reason to buy Dell. Similarly, I've felt the need to group a few Retail Stores into Retail-Chains because that's what they are. And the Retail-Chains have Owners and Managers who oversee the sales for each Manager and Retail Store respectively. I find it next to impossible to insert this group in the middle of my existing levels without a major code overhaul. While I hate hard-coding business rules into my programs, unless you have a nested database structure, you pretty much have to specify hard-and-fast rules on what to do when Level 0, 1, 2, 3, and 4 log in. Now I need Levels 2.33 and 2.66 but it's not something I can do on the fly. Of course, even with a nested structure, I'd still have to change my web page code to show certain things based on the user's level, but at least the reports wouldn't be so hard to generate.

While everyone's talking about OODMBS and it's advantages, I've yet to see any major OODBMS system take off into the mainstream. Unless you are in a business that requires the latest in technology and unique innovations, it's better to stick to the mainstream and use the popular products, most of which are free and have numerous tools available for setup and maintenance. I would love to use a nice object-oriented or somehow natively-nested database as long as it's cheap, provides me with enough tools to query from web-based applications/frameworks, and has a moderately-sized userbase with some decent community support. Oh and it can't crash all the time or be too slow. And obviously it should export to the beloved XML format with little or no effort :) Let me know when you find it!