Preface: I wanted to talk about database performance tips to help with the overall design of your system and structure to assist with the best performance possible on very large databases whose record count could potentially get to millions and millions of records. Several factors need to be taken into account
in the initial database design to prevent unbearable lag in performance later.
The best way to describe what to do is to describe what not to do or to provide performance replacements for typical database use that could potentially cause lag in large databases.
While some of these steps in general could be applied to even MS SQL SYSTEMS and Databases. The intent of this article is to provide for an understanding of what to do in a LAMP environment. (Linux, Apach, MySQL & PHP).
I am going to speak in general terms here to save time. If anyone needs some specific information or examples. Just comment and I will post a future article regarding that.
1. Some common performance mistakes.
A. JOIN statement. This statement is a very useful statement to bring 2 tables together into one set of query results. With large table record counts though. The action of joining 2 tables, even when indexed properly, can be a real performance drain. Even retrieval of 1000 records from 2 tables could take a couple minutes to process depending on the complexity of your SQL Query and the content and record count of the 2 tables. A better way to retrieve the same information with less performance cost is to use GROUP BY. Using GROUP BYcan potentially return the same information desired in an incredible fraction of the time it might take to JOIN two tables.
B. Multiple Field indexes. Single field indexes are great for retrieving records when your WHERE condition is seeking a single field. For small databases, there is not a noticable performace dip to have several single field indexes and use a WHERE condition on multiple fields with or even without single field indexes. However, when retrieving records from tables with 50,000+ records, using single field indexes can cause a huge performance issue in retrieving your data the more records you have. To solve this problem, rather than, or in addition to single field indexes, create an index acroos multiple fields. Try to limit the number of fields specifically to the fields you might be using in a WHERE condition to retrieve records. My results in using multiple field indexes, to retriece data using multiple conditions is that the data seems to come back to me at a very quick pace, perhaps half the speed of a sequential record retrieval with no conditions.
2. Database structure.
A. Tables.
From the beginning of Relational Databases, people where now able to tie together many tables when needed avoiding the large overhead and repetition of data in fields and records with a totally flat database structure. I think that with this though we can get a bit carried away with the number of tables and relations to the point where it seriously degrades performance. Until Christopher Columbus exclaimed that the world was round instead of flat, people got along just fine with flat thinking. I am not saying to make flat databases, but rather use causion on the number of relational points as to not hinder performace.
B. Data Organization
Lets say you are storing some historical data, if you are like me, you started your SQL database to join together many company information feeds into one place to do away with many older systems that did not use SQL. You might prepare a plan of your data to be stored into it’s exact structure broken down into a myriad of numberic, data, character, and other types of fields. Consider this though. The more data types you have and the different types you have, the more processing your system will have to do when retrieving data. I would suggest preparing a storage plan where you identify fields you will be searching by and then combining the rest of the field data into single fields that have many of your historical data fields in a fixed column format inside a single text field. By using this plan, your system can do less work retrieving your data and perform faster. You would want to group these historical fields into catagories so that you don’t have to retrieve all your historical data every time you need a piece of it. Just cut down on the total number of fields this way and you can see some of the benefits.
Of course there are many other data organizing techniques that can provide you with increased performance, however the details of those depend greatly on the data being stored, and what information you plan to retrieve. So I will entertain some specific questions if you have them in a different post.
C. Too many tables – System concern.
This topic relates to file fragmentation. With SQL Databases, file fragmentation is going to occur. Most SQL Databases are live 24/7 so you can’t really take the system down to defragment your files. It’s been stated that using InnoDB rather than MyISAMs helps with the fragmentation issue. You can visit Peter Zaitsev’s site by searching google “MySQL File System Fragmentation Performance Benchmarks” for more information on this. Also you can buy the book “High Performance MySQL” by Peter and many other authors.
3. Millions and Millions of records.
Asside from fragmentation concerns which can seriously hamper MySQL performance, the aspect of retrieving and crunching Millions of records, especially in PHP, would be a time consuming and daunting task. In some cases it just cannot be done with convential means, the problem is just to big to handle with normal record retrieval. Fortunately MySQL 5.0 finally implemented stored proceedures and Functions. Now you can retrieve data to the stored proceedure or function at the server level, crunch the data, and just retrieve back the cunched information you need. This is very helpful for gathering statistics, controlling quotas, managing departmental information flow, etc. CAUTION. Use care and planning when implementing stored proceedures. It’s possible to bog down the server with too much to do, and max out the memory in the server and bring everything else to a crawl. For safety I recommend replicating your origonal database to a read only copy on another MySQL server and then using that server to do the stored procedure crunching to prevent a few things. Namely prevent lag on non-crunching related operations such as casual record retrieval, update, and insert operations.
Thats’s all for now.
More later.