June 30, 2010
Database Design 101b, #SQLU
For today’s class at SQL University, (#SQLU on Twitter, the brain child of Jorge Segarra, who blogs at The SQL Update Statement, and can also be found championing the Social side of DBAness as @SQLChicken on Twitter), we’ll be delving a little bit deeper into gathering requirements, we’ll determine what our database will look like based on those requirements, discuss normalization of data and touch on a couple of different ways databases can be built.
Thanks for my involvement (again)
I know it sounds like I’m sucking up (and maybe I am, I’m not telling!) but I would like to once more express my heartfelt thanks to Chancellor Segarra for inviting me to lecture in this esteemed forum. It can’t be said enough what a great opportunity this is, and I am greatly appreciative.
Why are they important?
There are many ways to build any database, but they all start with one question – what data will they hold? This is determined by the requirements of the business. In a lot of cases a business will live or die on its data accuracy and completeness, which should illustrate its importance. I know I am sounding like a parrot, repeating the words “business requirement” over and over, so let’s take this time to define what I mean.
What are they?
Business requirements simply means this: What data does the business need in its day to day operations? Taking some examples from my last lecture, if we’re designing for a financial institution, then our database will be holding customer names and other identifying information, financial transactions and the like, including information dealing with the institution itself, and how all the customer transactions affect the overall business. If we are designing for some sort of retail outlet, we again will probably have customer identifying information, along with their orders and payment methods, and we’ll probably also have some part of the database devoted to keeping track of the business’ stock and supplies.
In the initial stages of design, the database designer should sit down with the business’ management, and ask what sorts of data they expect to deal with. Some parts of it, as I mention above, will be obvious and quick to go over, some not so much. In the retail scenario, you may run across details such as the warehouse location system, the different methods of shipping, etc. Asking for further and deeper detail will make the design process go easier, as it is better to have all of the entities (defined as pieces of data, such as customer names and addresses, or warehouse locations) to be recorded defined at the outset.
What managers/customers want vs. what they need
I used this heading in my last lecture, but it bears repeating. Managers and customers have an idea what they want, but they don’t know what they need! A retail manager will tell you that the business needs to know a customer’s address, this is obvious. What may not be so obvious is that customers can, and do, have more than one address. In addition to possibly separate billing and shipping addresses, customers can have more than one address to ship to; they may be shipping gifts to other people, etc. This means that we need to take into account multiple addresses for each customer. This is only one example, but it points out that only a very thorough and detailed conversation about the business’ use of data will yield what their data needs are, and the data professional building the database needs to be savvy enough to understand the underlying data needs.
Requirements Gathered, Now What?
The Entity-Relationship Model
That was a mouthful, what is that?!?!? Entities are groups, or clusters, of information, such as the group of information that defines customer identifying information, names, addresses, etc. As we’ll discuss in the normalization section, how different entities relate to one another is the relationship part of that tongue twister, for example, the orders that a customer makes – orders and customers are related to each other so the business can tie them together. Finally, the Model part of the heading refers to how the database is initially conceived – as a model, not necessarily a true database.
Database modeling is the act of defining the entities, filling those entities with attributes; that is pieces of data that help define the whole, such as first name, last name of a customer, then finally relating those entities to each other. It is at this point in the process that database design tools come into the picture. You’ll remember from the first lecture that I stated that these tools can run the gamut from a napkin and pencil to very expensive, complicated software that can model the data then actually build the database for you!
The Chicken and the Egg
I got stuck at this point in writing this lecture, unsure whether to dive into Modeling tools, or Normalization first. The two sort of go hand in hand. The tools build the relationships; the relationships are determined by the level of normalization. It was when I finally wrote that last sentence that I realized we need to discuss normalization first, tools later. So…
Normalization can’t be stressed enough
What is this Normalization?
Normalization is the result of normalizing data. Clear as mud, right? By normalizing data, we remove repetition from within the data. Let’s work through an example. If I need an entity that describes a customer, what should that entity contain? At minimum, it should have at least the following columns, or fields, of data: First Name, Last Name, Address, City, State, and Zip Code. I’m in the US, so for purposes of discussion, I will ignore other country’s systems of address postal boxes or residences. Further, there may be more detail required, but these fields will suffice for discussion.
Now, an entity, or table, with the above fields, looks on the surface to be perfectly valid, but let’s add some data to it:
|John||Smith||123 Any St||Big City||MA||02134|
|Jane||Doe||456 Main St||Small City||MA||01234|
|William||Smith||2 Mulberry Ln||Small City||MA||01234|
This is, obviously, a lame data set, but it’s enough to illustrate the point. Look at the repeat data within the data set. People with the same last name, people in the same city. While it doesn’t look like much here, how much repetition will there be with thousands of customers? And, what if there are orders, in the same data? Look at this abysmal example of an entity:
|John||Smith||123 Any St||Big City||MA||02134||001012||345||Disc|
|Jane||Doe||456 Main St||Small City||MA||01234||001453||651||Bowl|
|William||Smith||2 Mulberry Ln||Small City||MA||01234||001232||345||Disc|
Again, lame, just follow me. What do we do if there are more orders than one for a given customer? Add more columns? How do we go about finding those extra orders later? And can we use this type of system to keep up with inventory? I hope you see the answer to these questions.
The answer is normalization, and the way to do it is through what are known as keys. Keys are like codes that define different parts of the data, allowing us to match data in one entity to data in another by way of relationships. The quickest way to explain is by illustration, and I will illustrate the concept by showing you the result of normalizing this set of data into the following related entities:
|0001||John||Smith||123 Any St|
|0002||Jane||Doe||456 Main St|
|0003||William||Smith||2 Mulberry Ln|
It looks like more data than we started with, and it is, for this small data set. It also can be normalized further, as I will discuss in a moment. But you can already see the advantage of this system as the data set grows and the data gets further normalized. There will be fewer repeated rows of information, as illustrated in the Cities entity, and, once the orders entity is normalized to separate out part numbers and descriptions into their own entities, even less.
The Normal Forms
I will just touch on this subject, and allow the student to follow up – in fact, one of my fellow professors will be in this week with a more detailed article on normalization, in more fact, I am honored to be partnered with this particular professor this week – I will leave his identity a surprise. For now, though, realize that there is what is known as different Normal Forms, called First Normal Form (1NF), Second Normal Form (2NF), etc., right on up to Sixth Normal Form (6NF).
E. F. Codd, the inventor of the relational database, came up with the set of rules to define how data should be organized within a database, these are the Normal Forms. Essentially, each hierarchy of Forms reduces repetition further. The Normalization illustrated above is really only 1NF. Breaking out the cities into their own table with their own keys starts into 2NF. As you go up through the Forms, at the top you are left with only two columns per entity, the relational key and the data, such as a City# (key) and City (data). While it looks strange at first glance, the more Normal your database, the easier it is to query.
Data Modeling Tools
Now that we have some understanding of Normalization, what tools are available to us to build a Normalized Database?
The Napkin and Pencil
I love the fact that this set of tools is valid for building databases – it emphasizes that databases are, at their base, intuitive things. What we’ve done above is really a napkin and pencil database design, even though it was built in a word processor. It was just a quick jotting of what data is needed, and then a breaking down, or normalization, of that data. The same can be done with pencil and paper. A bit more time consuming than some other ways, but still valid.
The Big Guns – Professional Data Modeling Software
There are quite a few commercial and free data modeling software packages available to the designer. These can take the input of data needs and visually design the database on screen, ending in actually creating the Tables, Primary and Foreign keys and other objects in a ready to use database. Prices range from nothing all the way up to prices so high websites won’t divulge them – you need to contact the company’s “Sales Engineer” to find out how much. I say, if you need to ask how much, you probably can’t afford it!
An Option, and a Caution
Within SQL Server is a Database Diagram tool, within which you can lay out tables and keys, and relate them to each other. This is an option for the intrepid designer, which Microsoft obviously thought, as they include it in the package. There is one caution you must remember if you use it, though – it builds the database as you diagram it, while you diagram it! This means that SQL Server is changing and rearranging your database as you make changes to it, and if this is a production system you are doing it on, your overall server performance will suffer. Use it if you will, but be warned of the consequences.
Wrap it up
What we’ll cover next time
For our next class, we will discuss creating the physical database (so far we have only really talked about building a model of the database), we’ll talk about keys a bit more in depth, and their less thought of brothers, indexes, and we’ll talk about some methods of getting information from data, to illustrate why the design is important, something I said I would do today, but realized I needed to cover a few more subjects before we got there..
Check this out for homework
For homework, I am going to go out on a limb and suggest someone *much* smarter than I to reference on the subject of Database Design. I am giving here a 100 level class; the gentleman to whom I refer can take you as far as you want to go. I speak here of the Great Buck Woody, who is a SQL Server MVP and is currently a SQL Server Technical Specialist at Microsoft, and point you to his starting article entitled Database Design: Requirements, Entities, and Attributes. There are links at the top and bottom of the article to go Next and Back, I would suggest you use them.
Until next class, then, enjoy yourselves.