Choose A Databast Management System

Computing : Software

Anyone looking for a good flame war can always drop into any software development forum and casually ask what database software should be used for the next project, or even ask whether he or she needs to bother with a relational database. Because databases technologies are such an important part of programmer philosophy, it is hard to find objective discussion for the hapless developer looking for a good, general-purpose database management system (DBMS) to use in a project. How does one choose between the relative features of different DBMSes? What sorts strengths and weaknesses of various products can one anticipate products? This article tries to discuss some of these matters on even ground.
The study of databases is a battleground of ideas. The database community is one of the oldest in the computer world, and it is almost as famous as the application programming community for the diversity of its ideas and the sharpness of the debates between its gurus. Lately events have conspired to expose these concerns to a wider audience. For instance, the seemingly inexhaustible march of the Web revolution has exposed more and more developers to database issues because of the desire for ever more dynamic Web sites. And the crown prince of Web technologies, XML, has had the effect of increasing awareness of data design in general.

This means that more and more developers find themselves choosing between database management systems (DBMSes). This can be a daunting choice considering the many available DBMSes, both open and closed source, and the broad spectrum of differences between them. This article provides some guidance through the maze of available DBMS features and methodologies, to help the developer quickly narrow the choices to the best candidate.

Note that no DBMSes are mentioned by name. There are many resource that provide the feature sets of particular offerings. This article focuses on the general technologies, and hopefully also gains the advantage of minimizing bias towards any particular package.

Models

Probably the most fundamental choice to make in the DBMS hierarchy is the model used to store, manage, and query databases. Besides affecting what software you need to acquire, this affects the very way you will think about the data, and can be a surprisingly hard choice to undo later on. Note that as I keep the discussion to database models that are in significant current use; this is not meant to be a comprehensive survey of DBMS methodologies through history since some significant models, such as network databases, have been omitted because of a lack of modern tools and practice.

Ad-hoc databases

The earliest databases were merely ordered aggregations of raw data, which one could call an ad-hoc database. They are not stored so as to optimize storage or queries (a query is a request for database records that conform to a given pattern), but are usually designed to be read in as a whole by the application (although there have been "random access" ad hoc database systems). Ad hoc databases are still used quite often. From address books that are merely sequences of the address information in files, to the bulk e-mail lists most notoriously used by spammers, these are most useful when you have fully anticipated the use of the represented data and you are certain that more efficient or query-friendly formats won't ever be necessary. In this case, you needn't read much further. Your DBMS is encoded in the standard library of your favorite language: fwrite for C users, BufferedFileStream for Java users, etc. Ad-hoc databases are very efficient and convenient if all the data is of low volume, is typically accessed together by an application, and a single application at that (that is, there is not much current or future need for sharing the data between applications). However, they become very unwieldy, inefficient and unmanageable once they grow beyond the size of available memory, as access patterns change, and if they need to be shared between applications.

Hash-based databases

Early algorithm specialists were quick to attack the problem of inefficient queries by coming up with systems for creating hashes of data records, which are compact keys that uniquely identify the record. Hashes are easy to manage and with a key, one can rapidly retrieve the entire record. Hash-based DBMSes, which use these techniques, are quite popular because of their simplicity and the fact that they come for free with most UNIX systems. They are very fast, and almost every programming language provides APIs for their access, but they tend to be quite bare on features. They are very well-suited to situations where an application wants to pluck records one at a time from the database, using a well-defined key. An example is for user profiles and authentication; where the application looks up a record by user ID, does its thing with the data, and moves on. They are less well suited to situations where records need to be cross-referenced, or the information in the records needs to be sliced and diced in some clever way.

Hierarchical databases

An early development in DBMS was to organize information into regular records containing other regular records in a more structured way. These are known as hierarchical databases and have enjoyed a bit of a revival with XML's popularity, because XML has a general hierarchical structure. Hierarchical databases can be quite suitable for data such as purchase histories that consist of tightly coupled records of information, for example, customer information to purchases made to support calls placed. The problem with hierarchical databases is that they have a way of accumulating redundant data (which was one of the main claims of relational databases in their battle to wrest dominance from hierarchical databases in the '70s). Another problem is that they can be hard to query flexibly in ways that go against the tight coupling of the data hierarchies.

Relational databases

Relational databases are, of course, the current king of the hill in database technologies. This doesn't mean that more data is kept in relational databases than any other model, but rather that when one goes about asking about what the "real" database model of choice is, he or she is most likely to be told to get relational religion. There is some good reason for this. Relational databases are wonderful for discouraging redundant data and for the speed of complex queries; they also have a huge number of tools and APIs to support them. They are best used in situations where a lot of records are being combined and cross-referenced to synthesize results. An example might be the production data of a manufacturing firm, where information about inventory, part specifications, personnel availability, costs, sales and supplies need to be thoroughly analyzed in order to make production decisions. However, like any power tool, they can be quite dangerous. Relational DBMSes (RDBMS) are designed to model very highly structured data which has been modeled with mathematical precision. If one's database design is not up to snuff, not only might the advantages of the relational model be lost, but the result can actually be worse for maintainability than with less stringent models. If you do opt for relational databases, be sure you understand concepts such as normalization and referential integrity. These days, almost every RDBMS uses the Structured Query Language (SQL) for description and querying of the records.

Object databases

Object databases emerged as a way to translate the techniques of object-oriented programming to data storage models. The data are organized as distinct objects, each of which belongs to a class, which might use inheritance to acquire aspects of other classes. Each object can have a set of attributes of simple types such as integer and string, and relationships to other objects. As you can imagine, they provide a very natural API for access using object-oriented languages such as C++, Java and Python. Object databases can be a great choice for this reason, but it can also seduce programmers into poor data design: techniques that make sense when the data lives in memory can be very slow and resource-intensive when the data is stored on disk.

Semi-structured databases

The emergence of XML has enlivened another corner of database modeling: semi-structured databases. As RDBMS took over the universe, many developers lamented that their rigorousness made them unsuitable for modeling data designed directly for human consumption, that is, more loosely organized records including structured documents and systems that made it easy to make changes in the model. Efforts to provide DBMSes that accommodated such "semi-structured" data thrived in academia until XML took them to the mainstream. Most XML formats define semi-structured data, and so XML -- whether directly stored in files or in an XML repository -- provide a great deal of flexibility, especially in Web-based systems where the documents are as important as the structured records. The main drawback is lack of efficiency. The data typically take up much more of the available resources than with other database models, and queries can be slow and cumbersome to set up. Semi-structured databases are very strong where documents and more structured data coexist, such as Intranets and Web-based applications.

Back to top

Languages

Since most databases a vital organ for a complete application, the interface between the database and the application development language is quite important. The DBMS of choice should have a natural and efficient API in your programming language of choice, and preferably more than one, since competition improves quality across the board. Because developer effort is usually more expensive than run-time resources, it is probably most important that you choose a DBMS that supports the APIs and languages with which you are comfortable. It's often worth even going to a database model that is less suitable to the data in question if it is more suitable to the skills of the available developers.

There are two general approaches to language integration of DBMS access. One involves embedded access, where the language actually provides first-class structures for the DBMS. You use your usual language constructs and add inline database commands without any ado. Usually, you can directly use values from your regular programming in the database commands, and vice versa. Examples include ANSI SQL-C, which allows one to write SQL directly into a C program, using C variables as parameters in queries, and query results as C values. Many scripting languages also provide embedded database access, and languages such as Perl and Python that support hash-based data structures usually provide a transparent method for making these structures persistent through hash-based DBMSes. Finally, most object-oriented databases strive for transparent persistence of arbitrary data objects with a virtual flick of a switch.

The other approach is through function calls to a database engine. This is the most common approach and includes SQL call-level interface (CLI) in C, Java Database Connectivity (JDBC) and Perl/Python Database Interface (DBI). The disadvantage is that one needs to do special housekeeping to get values from the programming language to the database commands and vice versa. The advantage is that one usually has more flexibility in choosing the database access engine because the database commands are logically separated from the rest of the program.

Devices

Of course the DBMS of choice must work on the platform used by the rest of the application (or must at least be accessible from this platform), but there might be other platform needs as well. Be sure to consider who might end up using your system, and choose a DBMS that would run on other important platforms in future. This is not always directly obvious; for instance, do you think your database might grow until it is too much for your current hardware to handle? If so, does your DBMS run on platforms that support clustering? Does it have special features to take advantage of clustering?

And don't forget the brave new world of wireless computing. Perhaps the users in your near future will want to access your database on the go, from their palm computer or cell phone. There are DBMSes with a very low "footprint" which are suitable for execution on such devices and there are also DBMSes that provide low-footprint access programs to allow connected wireless devices to grab the data right from the central store. These might be features to consider for your endeavor.

Features

Probably the most important general features to consider in your DBMS hunt are security-related. Consider how thoroughly the DBMS requires authentication from users and keeps an audit trail of the accesses. But security goes beyond keeping out malicious users. Be sure your backup supports backup and restore, not just by archiving your raw database files, but also the ability to integrate into incremental backup regimens. It might be enough to have options to dump to or restore from structured text (which can be incrementally backed up using tools such as diff), and of course direct integration into the backup software for the system as a whole is even better. Structured text dump and restore are also a boon for interchange with other systems. Examples include comma-delimited formats and dumped sequences of SQL commands.

Consider whether the DBMS supports access by multiple users at once (multi-user support), which is an important feature in many situations. If so, beware of common problems where database state becomes inconsistent because modifications are interrupted by error conditions, or such modifications affect the state of other programs accessing the database. For instance, what if your program reads a field twice from the database, and another program modifies the data between these reads. Your program might not be prepared for the resulting disparity. Solutions to such problems are called transaction and concurrency (locking) control and are important features if your database requires a high volume of access from multiple simultaneous users.

Standards support is very important if you want skills and code to be portable to different ventures than the current database development. If using RDBMS, be sure it features broad support for SQL. If you are going to be bound by the rigor of the relational model, you should at least be able to take advantage of the whole extent of its features. If you are using an object-oriented DBMS, support for the Object Database Management Group's (ODMG's) standards provides better hope of porting your code to other DBMS products. Regardless of the chosen model, language or platform, investigate what open standards there are for DBMS and look for these in the products under consideration.

And of course you should be sure the DBMS is usable. Does it have friendly tools for direct manipulation by the administrator? Does it have good documentation online, with options to get paper documentation if required? If it is a commercial product is the reputation for technical support a sound one? Newsgroups and user mailing lists are good places to get clues to this. If it is not a commercial product, you will probably be relying on newsgroups and mailing lists themselves for service. How active are the developers on these fora? Look at the mailing list archives and Google news for clues on this.

Conclusion

The choice of DBMS can be almost religious (as can be the choice of programming language or platform). Be very careful with the arguments you may hear for a particular product and keep your mind open. Part of the reason there is such a variety of DBMSes is that it is not a one-size-fits-all choice by any means. Every nuance of the project you are undertaking can affect DBMS choice. Also don't be afraid to make a prototype of the project in question before diving fully in with your chosen project. There is less risk in your choice if you can quickly learn from a bite-sized chunk whether or not your initial inclinations were right. You might want to use a free database for such a prototype, or you might want to ask the DBMS vendor of interest for a trial copy. Most vendors make evaluation copies available (although you might just get weekly calls from the regional sales rep after you start your evaluation

By: Yatesh
Share on Facebook: On Twitter: Twitter Tweet this!
Reply to Choose A Databast Management System
Receive Our Newsletter

Questions about database:

Ask question

More Articles:
How to Clean Your Prefetch to Improve Performance
How to rename the My Computer icon
Website Topics