-

My Latest Work

Posted by InfinitiesLoop on Jun 17, 2008 in ASP.Net, Dotnet  | View Original Article
 Sky Lynn Reed's 0th birthday is 5/15/2008 at 5:44pm, when she weighed 8 pounds, 0 ounces. A nice round binary number. Our first born! There are simply no words to describe the experience. We debated on whether we should have kids for a long time. After...(read more)

Tags: ,

 
-

The Golden Rule of Data Manipulation

Posted by Jeff Smith on Jun 11, 2008 in Database, SqlServer  | View Original Article
 Introduction There is a very simple rule when it comes to storing (and returning) data, which I see violated all the time, making life so much more complicated for everyone involved.  In case you haven't noticed, that's a common theme I discuss here on this blog -- different ways programmers make life more difficult for themselves, instead of simply following good practices and doing things the easy way.  This is yet another example of that situation. The "Golden Rule of Data Manipulation" is a simple, but important rule that you should always follow when designing a database,  writing database code, or really writing any application code at all for that matter:
"It is always easier and more flexible to combine data elements rather than to break them apart"
In other words: Concatenation is easy. Parsing is hard.  Often, very hard -- or even impossible depending on the data. Problems with Parsing It is amazing how often I see people struggling with "difficult SQL problems" such as:
  • Working with CSV lists of values in a single column, such as "1,3,56,2"
  • Breaking out a FirstName/MiddleName/LastName/Suffix from a single "Name" column
  • Parsing address strings into City/State/ZIP, or Number/Street/Unit
  • Parsing Phone Numbers to get just an area code, or to take different phone formats and present them all uniformly
  • Figuring out how to calculate the Day,Month, and/or Year from different string values such as "23-Jan-08", "2008-02", "20070303", "03032007"
And on and on it goes.... Now, sometimes you inherit or import data that needs to be parsed -- that's a fact of life.  You've got to figure out how to do it, and the key in those cases is to accept that because the data itself is essentially random, nothing you can write will perfectly work 100% of the time on all of it.  Often, the best you can do is handle most of the data, and then do some manual clean up. Parsing strings can be a very difficult task for any programmer, and the challenge isn't writing the code, it's coming up with the algorithm (another common theme on this blog).  Consider my new favorite example of why parsing a single Name column into a First/Middle/Last is not as easy as it seems: Oscar De La Hoya How would your algorithm parse that one?  Never mind prefixes such as "Dr." and suffixes such as "Jr."! Please don't interpret what I am saying as a programming challenge -- I understand that it is possible to write long code with a list of exceptions or rules and have that algorithm work pretty well in most cases.  The point is that writing that algorithm is a lot of work, running it will be inefficient, and it will never be exact because the data itself that is being processed is essentially random.  It's just like the old saying: "garbage in, garbage out".  Still one of my favorites, after all these years, and it still applies! A Data Model that requires Parsing = A Poor Data Model So, we need to accept that sometimes you've got to parse data like this.  And that's OK; it happens, it can be done, even if some manual work is often involved. However, there's no excuse when you design your database, your SQL code, or applications so that free-form data must be parsed, when you can simply design it correctly in the first place and store your data already broken out into the smallest possible units with the correct data types. If breaking out a contact's name into First, Last, Middle, etc is important to your application, then you should force the point of data entry to accept input broken out into those columns.  The same goes for phone numbers, addresses, and so on.  Any time you have the option of accepting  input as clean, short, raw, segments of data you should always do it.   Once you have data at that smaller resolution, it is trivial to combine it any way that you want for presentation, formatting, filtering, and so on. It may seem like overkill to break out a phone number into 4 columns:
AreaCode Exchange Number Extension
And, in fact, it might be more complicated than that if you need to deal with international phone numbers.  You may look at your tables, and your code, and even the UI that accepts these fields and think "that is way too precise and unnecessary, breaking out phone numbers like this sure makes things complicated!" But by doing this, and only accepting user input that follows precise rules of what is allowed in these fields, and storing each of them in their own column, you can now easily and efficiently:
  1. Sort these numbers any way you want, without worrying about extra characters like parenthesis or dashes, or leading 1s, messing things up
  2. Filter quickly on an area code without the need to use LIKE, and again worrying about extra characters getting in the way
  3. Present the phone number quickly and easily any way you want without any parsing, be it as 123.123.1345 x123 or "(123) 123-1345 extension 123", or anything you want.
  4. Validate your phone numbers, ensuring you have all the necessary parts and they are the proper length, without worrying about parsing strings
Considering doing any of those things if your data is stored in random strings like:
1-123-124-1234 (123) 124-1234 123-124-1234, ex. 123 123.124.1234 x123 (123)124.1234 ext. 123 1 123 124 1234 123
and so on ...  Not so easy in that case, just as parsing simple "Name" columns into First/Last, or addresses into  Number/Street/Unit is not so easy as well. Again, this is not a programming challenge -- I am sure it can done. (In fact, phone numbers are generally the easiest because you can usually just ignore anything other than digits.)  Most of us have done it before.  But designing something in such a way that parsing is required to do simple filtering, sorting, or formatting, is a bad design. It's Not About the UI As I wrote here, you should never think "I want to display phone numbers like 123.123.1234, so I should store them and return them that way."  You should always think "How can I break this down into small, concrete parts that are easily validated and easy to combine any way I want at any time?" So, what if you need fine detail when storing addresses, but you don't want your UI to present Street Number, Street Name, Unit Type, Unit Number as different data entry fields for usability or aesthetic reasons?  That's fine, but that doesn't mean you should not set up your database properly.  Your UI can certainly still present that one single "Address" text box for the user to fill out, parse that text at data entry, show the user the parsed result in multiple fields, and ask "Please verify for your address" or something along those lines.   Then, if not, the user can tweak the results and save it.  If you do things along those lines, and focus on getting the data parsed and stored correctly at the earliest point possible, every other part of your code will be that much more efficient. All of this applies not only to data storage, but to how data is returned and passed between tiers as well.  Again, if you just return separate columns to your client application, instead of focusing on making them "look nice" in your database code by returning nothing but long, "pre-formatted" strings, your client can simply concatenate and format those columns any way it needs.  And, different clients can format that same database output in different ways -- all without ever altering any database code! Conclusion In short, remember that writing concatenation is easy, efficient, and exact.  Writing a parsing routine, on the other hand, is often none of those things.   You may not always be able to control the design of the data you are working with, but be sure that when you can, you do it right.  If you find yourself using lots of LIKE expressions, or string parsing for simple data retrieval operations, something is wrong.   Time to fix up your database and your code, store the parsed and validated data permanently, and make things easier and cleaner for everyone. Whether you are designing a schema, writing a SELECT, or writing code in any other programming language, remember that the Golden Rule of Data Manipulation always applies.  Accept this rule, learn from it, and practice it, and you might be surprised to find that programming isn't quite as hard as you thought it was.

Tags:

 
0

365 Tips for Green IT

Posted by Administrator on Jun 8, 2008 in Green Computing, Virtualization  | View Original Article
 

COMING SOON!  If you are interested in Green IT, this is a not-to-miss eBook.




eBook Tip Categories

  • Developing a Green IT Strategy
  • Governance
  • Lifecycle Management
  • Energy
  • Paper
  • Electronic Waste
  • Water

Register to receive our Newsletter (right) and we'll let you know when the eBook is out!

Tags: ,

 
-

The Truth about “Cursor Busting” in SQL

Posted by Jeff Smith on Jun 5, 2008 in Database, SqlServer  | View Original Article
 Let's say you are called in to troubleshoot a stored procedure that is performing poorly. You dive in to investigate and this is what you find:
create procedure ProcessProducts
as
declare @Products cursor, @ProductID int
set @Products = cursor for select ProductID from Products order by ProductID
open @Products

fetch next from @Products into @ProductID

while (@@FETCH_STATUS=0)
begin
exec DoSomething @ProductID fetch next from @Products into @ProductID
end

deallocate @Products
Ah ha! A cursor!  It seems we have identified the bottleneck: Clearly, the performance problems are because the code is not doing things in a set-based manner, but rather by processing rows one at a time using a dreaded cursor.  This cursor is opening up the Products table, looping through the rows one at a time, and calling the "DoSomething" stored procedure for each ProductID.  As we all know, cursors are not the way to go when writing SQL code; this cursor should eliminated and replaced with a cleaner, more efficient (and more socially acceptable!) solution. So, how we do optimize this?  Well, a commonly suggested approach is to eliminate the CURSOR by replacing it with a WHILE loop: declare @ProductID int
set @ProductID = -99999

while (@ProductID is not null)
begin
set @ProductID = (select top 1 ProductID from Products where ProductID > @ProductID order by ProductID asc)
exec DoSomething @ProductID
end Instead of declaring a CURSOR to loop through the table, we now are using "set-based" code and our problems seem to be solved.  The cursor is gone, our code looks much cleaner, we've tested it and it works properly, so off to production it goes.  Another cursor has been busted! Right? Actually ... no. You see, eliminating cursors is not about syntax.  It is not about searching for the word "cursor" in your code and just replacing it with a WHILE loop that does the same thing.  Optimizing and replacing cursors involves much more.  We can never optimize any cursor code until we look deeper into what exactly is happening when we "process" each of those rows.  In this case, we need to find out what that "DoSomething" procedure is actually doing. Suppose the DoSomething procedure is generating a report and sending an email to the "Product Manager" for each product that contains status information, and then logging this email message into a table somewhere. If that is the case, what have we just gained by replacing our CURSOR? Honestly -- not much,  if anything at all.  Because of the task at hand, we may very well need to process rows in the Product table one-by-one to send our emails and generate the report, and the bottleneck here is not the cursor code at all, but rather the report generation and maybe sending the email.   Eliminating the cursor code probably gains us nothing here.  If you need to process rows one at a time, go ahead and use a cursor -- that's what they are there for!   Replacing a perfectly fine, simple cursor with a WHILE loop might even make your code longer, or more confusing, or even less efficient depending on circumstances. For example, what if we need to process the Products ordered by Region, then Product Name, for whatever reason.  Our cursor code is simple:
set @Products = cursor for
select ProductID
from Products
order by Region, ProductName
All that we needed  to change was our ORDER BY clause.  Now, how would we write this as a WHILE loop?  Is it possible?  Sure.  Will it be as simple and clean as using a cursor?  No, it won't. (Though ROW_COUNT() makes this much easier than it used to be) Now, I am not here to say that cursors are "good", but if you really need to process rows one by one, go ahead and proudly use a cursor.   Replacing cursors isn't about processing rows one-by-one in a different way (i.e., using a WHILE loop instead), it is about not processing rows one-by-one at all! Let's consider another scenario: What if the DoSomething stored procedure is checking to see if the Product's ExpireDate is greater than today's date, and if so, it is updates the Status column for that Product to 'X'. In that situation, what have we gained by rewriting ProcessProducts without a cursor, and using a WHILE loop instead?   The answer is, once again: nothing! In fact, we potentially have once again made our code more confusing or even less efficient than a cursor might be!  Remember, the bottleneck isn't the cursor syntax -- it is the fact that we are processing rows one at a time.  Replacing the cursor with the WHILE loop didn't solve this problem, did it? So, looking now at both of the scenarios I presented for the DoSomething stored procedure, it should be clear that we did not fix anything by replacing the cursor in either case simply by writing a WHILE loop.  If that's all you are doing, don't bother replacing the cursor at all. You haven't optimized anything. As I said before, the art of replacing a cursor is not a find-and-replace syntax change operation -- it is a fundamental change in how you process your data.  As in the Product report generation and email example, it may be that we simply need to process rows one by one, and thus no further optimization is possible from a SQL point of view.  In situations like updating the Product table, however, we do not need to process the rows individually -- we can do everything in one single UPDATE statement.  Thus, in order to determine how to optimize the ProcessProducts stored procedure, we needed to dig deeper into entire process as a whole, which included examining the DoSomething stored procedure and determining the full scope of exactly what this "ProcessProducts" stored procedure is doing. So, if "DoSomething" is updating the Products table as specified, we now know that a good replacement for our cursor code doesn't result in a WHILE loop and calling a separate stored procedure over and over at all -- it results a true, set-based solution:
create procedure ProcessProducts
as
Update Products set Status='X' where ExpireDate > getdate()

And THAT is how you optimize a cursor! No loops, no calling of another stored procedure for each row in a table, no "find-and-replace" cursor code removal.  We examined the entire process, and rewrote the entire process, to get it done quicker and shorter and faster without cursors or loops. Always remember: Replacing a cursor isn't about rewriting your syntax, it is about redesigning your algorithm.

Tags: ,

Copyright © 2010 Answer My Query All rights reserved. Maintained by Orange Brains .