What is De-Normalized Data?
Normally we like our data served up in flat-files with rows and columns. Columns store data we wish segregated for a record, like name, address, city, state or zip. Rows store the complete list of data as is pertains to one complete record; like a customer or product.
But what happens when our data won’t fit neatly into this model? Like a list of people attending a conference or links between product categories. A good example of denormalized data (that’s been normalized) are breadcrumbs on a webpage or a website menu system.
For the sake of demonstration the table on the left has a field labeled "firstname" and a field containing denormalized data labeled "list". The "list" field may hold a list of friends of the person named in the firstname field. It could also be a list of attendees to a meeting.
So what is so important about a denormalized data list. Just ask Facebook or eBay or Amazon. Their entire data mining models are based on denormalized data. Customers have different likes and dislikes. Some customers have many hobbies or favorite subjects and others may have very few. By normalizing "denormalized" data we are better able to identify target audiences, behaviors, and more importantly "patterns". Yes, can we all say PATTERN! You may have seen a detective story where the savy investigator discovers a suspect has been seen near the scene of three homicides or robberies. If the suspect was around one of the scenes it may be thought of as a "coincidence"; if he/she was seen around two of the crime scenes…? definitely not a coincidence; and if he was seen around THREE of the crime scenes! GUILTY! (maybe…). Regardless, a denormalized list can lead us to patterns we would not necessarily spot or identify as fast, as we might using a denormalized list.
XFN, FOAF and the Meaning of Life
Contemporary models like XFN (extended friend network), FOAF (friend of a friend), Six Degrees of Separation (Stanley Milgram) or The Three Degrees of Influence Rule – written about in "Connected" by Christakis & Fowler (Little,Brown & Company 2009) as an understanding of confluence patterns of groups of people and their connections. Most definitely a real meat and potatoes explanation of human being’s quest for spirituality, love and the after-life.
Back to the Real World
To locate hidden patterns in mountains of denormalized data generated by legacy systems, websites, spreadsheets, financial records, receipts and hundreds of other data sources it takes just a few things. First of all, denormalized data would be a good thing but what if you don’t have any? Then you make some! Using a few flat-files, some random generated strings here and there we can piece together enough data to enable us to develop software that automates the process of churning through millions of records looking for a few "needles in the hay stack."
Development Tools, the Cloud and Other Sundry Items
Open source and semi-open source solutions are here to stay. The days of being ham-stringed to licensing, versioning, and phone support are gone. Mainstream, seasoned products like MySQL, Actionscript, PHP, Flexicious (see part 4) and Adobe Flex 4.5 (Eclipse) are hard to beat for security, reliability, ease of use and overall adoption by professional developers.
For this project I am using Flex 4.5 Data Management Services which I found to be a bit friendlier with my AWS RDS MySQL account than my HostGator account. Flex has no trouble connecting to AWS via its built in php class builder and once you get through the Zend non-sense you are off and running.
Joe Celko and Heavy Lifting
The heavy lifting is done directly on the RDS server via MySQL Workbench; a phenomenal tool with advanced features you won’t find in phpMyAdmin or from the shell. Unless your Joe Celko who I am sure can will MySQL code into existence using pure thought and remote viewing, I suggest using it.
Workbench’s windowed interface lets you click between tabs for creating scripts, stored procedures, import/export and allows simultaneous connections to multiple servers all at the same time. The import/export tool is fantastic. No more time outs on data imports using phpMyAdmin.
PHP, Stored Procedures and Flex Data Management Service
For this project I used a normal php service class generated by the Flex 4.5 interface uploaded to the services directory of my hostgator folder where I store the Flex App.(this will be better explained in Parts 2 and 3).
I prefer using MySQL stored procedures whenever possible as they are easier to call from PHP,they run faster because they execute on the server and are more secure than a PHP file doing the same thing.
Custom PHP functions can easily be added to the Flex generated php class script. You can easily cut and paste existing class objects and modify them which will show up in your data services window which is pretty slick.
In Part 1 of this tutorial I am demonstrating how I was able to insert a random list of names from one table into a field of another table. The difficulty I encountered was generating the comma-delimited random string which contains from one to ten first names and subsequently inserting this result into a firstnames table to be used by an update query to insert one of the 500 strings, randomly into the demo data table.
The first thing was to create a table of first names that I could use to create comma delimited, random length firstnames string to update the main table. I created the first table "firstnames" using this script.
*All examples were done with MySQL Workbench pulling data from AWS. PHP code is run from my Hostgator hosted website. I like to do it this way as it keeps the php files and data on different cloud servers and reduces bandwidth charges marginally but adds an extra layer of security by requiring authentication between both sites via frequently changed Elastic IPs on AWS and a subsequent updated entry to the Hostgator CPanel "Remote Access" interface.
The firstnames Table
use db; drop table if exists firstnames1; create table firstnames1 ( firstname varchar(25), id2 int(11) ); alter table firstnames1 add column id INT UNSIGNED NOT NULL AUTO_INCREMENT,ADD PRIMARY KEY (id); alter table firstnames1 add column list varchar(255),ADD INDEX (list);
This may look a bit unorthodox at first. I use this method to generate most of my initial tables. MySQL requires a minimum of two fields to create a valid table and the typing can be redundant when working in a large database project so I have a library of scripts I can call up quickly which require only a few changes to generate a new table. It also provides more consistency to facilitate the writing of stored procedures. Of course a great programmer never writes the same line twice! (yeah…right
The table is probably self-explanatory but I will briefly explain two fields; 1) id2 and 2) list. I created the id2 field in case I needed a field for rank or storing a number similar to the rgt field used for pre-order tree traversal with nested sets. Regardless its always prudent to have an extra integer field.
The rstring() Stored Procedure
DELIMITER $$ CREATE DEFINER=`root`@`%` PROCEDURE `rstring`() BEGIN SET @A1:=FLOOR(1 + RAND( ) *500); SET @A2:=1; SET @var = NULL; SET @s = CONCAT("SELECT GROUP_CONCAT(firstname separator ',') into @var FROM firstnames WHERE id= ",@A1," group by id"); PREPARE STMT1 FROM @s; EXECUTE STMT1; select @var; END