2010-11-04

The lineage model for hierarchal data in a SQL database

Recently I decided to make a web app to keep track of a number of policies.  There were a number of different policies, all of which had their provisions numbered and organized in an outline hierarchy that often went three or more layers deep.

The first thing I researched was how to structure SQL databases to handle hierarchical data like a product catalog with the products grouped into categories and subcategories. Or a org chart where employees are grouped into divisions and departments. I quickly learned that most people used either the adjacency list model or the nested set model,  both of which are described in detail in an article on the MySQL website: Managing Hierarchical Data in MySQL.  Another detailed article on both models is Storing Hierarchical Data in a Database.

With the adjacency list model you have a table which gives each record a unique ID number, and then you have a "Parent ID" field where you record the ID number of the record's parent in the hierarchy tree.  If you were doing foods, you would have a record for "apples" and the parent ID for that record would be "fruits."  Then when you need to generate an output of the whole hierarchy tree you do SQL that organizes all the records by connecting the various parent ID values.  Some articles I found on this approach include Tree Drawing with the Adjacency List Model  and Hierarchical SQL.  The thing I didn't like about the adjacency list model was that the SQL required to generate a simple display of the data seemed unduly complicated (lots of recursion) and likely to consume a lot of database server processing power.

I never did really grasp the nested set model, so I won't try and explain it. Needless to say it also involved very complicated SQL that seemed likely to put a big drain on a server.

Then I found an article about a variation on the adjacency list model: More Trees and Hierarchies in SQL.  The author's basic idea was to include a Lineage field in each record where you record the ID numbers of the full path to the record in the hierarchy with each ID number separated by a delimiter character.  For example, if you had a product tree with the iPod in it, the Parent ID field would be "MP3 Players" and the Lineage field would be "Electronics/Audio Equipment/MP3 Players."  This makes the SQL to output the basic tree much easier: you just do a simple SELECT and ORDER BY the Lineage field and you are done.

So here is the structure I ended up with using the lineage field model.  For a hierarchy like this:

  • Plants
    • Fruits
      • Apples
      • Bananas
      • Peaches
    • Vegetables
      • Broccoli
      • Asparagus
  • Animals
    • Poultry
      • Chicken
      • Turkey
    • Meat
      • Beef
      • Pork
The table structure would be like this (except you would use ID numbers instead of the full name of each item as the ID):

Node Lineage
Plants /
Animals /
Fruit /Plants/
Vegetables /Plants/
Poultry /Animals/
Meat /Animals/
Apples /Plants/Fruits/
Bananas /Plants/Fruits/
Peaches /Plants/Fruits/
Chicken /Animals/Poultry/
Turkey /Animals/Poultry/
Beef /Animals/Meat/
Pork /Animals/Meat/
Broccoli /Plants/Vegetables/
Asparagus /Plants/Vegetables/

Once you put your hierarchical data into a structure like this then working with it is easy-peasy. Want to display the whole tree?

SELECT * FROM table_Foods ORDER BY concat(Lineage, Node)

This will output the data organized by groups and subgroups. Want to see just the items in the group Plants, organized by subgroups?

SELECT * FROM Foods WHERE concat( Lineage, Node ) LIKE '/Plants%' ORDER BY concat( Lineage, Node )

This will return just the Plants, grouped by Fruits and then by Vegetables.

Want to indent each item in your output based on its depth in the hierarchy? In your PHP code just count the number of slashes in the Lineage field using substr_count() for each record and set the indent accordingly.

I know this approach seems too simple but I have been working with it for a while in a web application and I haven't run into any dead ends yet, and figuring out the SQL for various tasks has been very easy.