PostgreSQL Performance Tips
取自:
http://chery.axonpro.sk/pgsql/chapters/performance/index.html
PostgreSQL Performance Tips
POSTGRESQL is an object-relational database under active development on the Internet. You can learn more by visiting http://www.postgresql.org.
In an ideal world, users would never need to be concerned about performance. The system would tune itself. Unfortunately, we do not live in an ideal world. An untuned database can be thousands of times slower than a tuned one, so it pays to take steps to improve performance. This article shows you how to get the optimal performance from your database.
Indexes
When accessing a table, POSTGRESQL normally reads from the beginning of the table to the end, looking for relevant rows. With an index, it can quickly find specific values in the index, then go directly to matching rows. In this way, indexes allow fast retrieval of specific rows from a table.
For example, consider the query SELECT * FROM customer WHERE col = 43. Without an index, POSTGRESQL must scan the entire table looking for rows where col equals 43. With an index on col, POSTGRESQL can go directly to rows where col equals 43, bypassing all other rows.
For a large table, it can take minutes to check every row. Using an index, finding a specific row takes fractions of a second.
Internally, POSTGRESQL stores data in operating system files. Each table has its own file, and data rows are stored one after another in the file. An index is a separate file that is sorted by one or more columns. It contains pointers into the table file, allowing rapid access to specific values in the table.
POSTGRESQL does not create indexes automatically. Instead, users should create them for columns frequently used in WHERE clauses.
To create an index, use the CREATE INDEX command, as shown in this
figure.
test=> CREATE INDEX customer_custid_idx ON customer (customer_id);
CREATE
In this example, customer_custid_idx is the name of the index, customer is the table being indexed, and customer_id is the column being indexed. Although you can use any name for the index, it is good practice to use the table and column names as part of the index name--for example, customer_customer_id_idx or i_customer_custid. This index is useful only for finding rows in customer for specific customer_ids. It cannot help when you are accessing other columns, because indexes are sorted by a specific column.
You can create as many indexes as you wish. Of course, an index on a seldom-used column is a waste of disk space. Also, performance can suffer if too many indexes exist, because row changes require an update to each index.
It is possible to create an index spanning multiple columns. Multicolumn indexes are sorted by the first indexed column. When the first column contains several equal values, sorting continues using the second indexed column. Multicolumn indexes are useful only on columns with many duplicate values.
The command CREATE INDEX customer_age_gender_idx ON customer (age, gender) creates an index that is sorted by age and, when several age rows have the same value, then sorted on gender. This index can be used by the query SELECT * FROM customer WHERE age = 36 AND gender = 'F' and the query SELECT * FROM customer WHERE age = 36.
The index customer_age_gender_idx is useless if you wish to find rows based only on gender, however. The gender component of the index can be used only after the age value has been specified. Thus, the query SELECT * FROM customer WHERE gender = 'F' cannot use the index because it does not place a restriction on age, which is the first part of the index.
Indexes can be useful for columns involved in joins, too. They can even be employed to speed up some ORDER BY clauses.
To remove an index, use the DROP INDEX command. See the CREATE_INDEX and DROP_INDEX manual pages for more information.
Unique Indexes
Unique indexes resemble ordinary indexes, except that they prevent duplicate
values from occurring in the table. This figure shows the creation of one table
and a unique index.
test=> CREATE TABLE duptest (channel INTEGER);
CREATE
test=> CREATE UNIQUE INDEX duptest_channel_idx ON duptest (channel);
CREATE
test=> INSERT INTO duptest VALUES (1);
INSERT 130220 1
test=> INSERT INTO duptest VALUES (1);
ERROR: Cannot insert a duplicate key into unique index duptest_channel_idx
The index is unique because of the keyword UNIQUE. The remaining queries try to insert a duplicate value, but the unique index prevents this and displays an appropriate error message.
Sometimes unique indexes are created only to prevent duplicate values, not for performance reasons. Multicolumn unique indexes ensure that the combination of indexed columns remains unique. Unique indexes do allow multiple NULL values, however. Unique indexes both speed data access and prevent duplicates.
CLUSTER
The CLUSTER command reorders the table file to match the ordering of an index. This specialized command is valuable when performance is critical and the indexed column has many duplicate values.
For example, suppose the column customer.age has many duplicate values, and the query SELECT * FROM customer WHERE age = 98 is executed. An index on age allows rapid retrieval of the row locations from the index. If thousands of matching rows exist, however, they may be scattered in the table file, requiring many disk accesses to retrieve them. CLUSTER reorders the table, placing duplicate values next to each other. This speeds access for large queries accessing many duplicate values.
CLUSTER even helps with range queries like col >= 3 AND col <= 5. The command places these rows next to each other on disk, speeding indexed lookups.
In addition, CLUSTER can also speed ORDER BY processing. See the CLUSTER manual page for more information.
VACUUM
When POSTGRESQL updates a row, it keeps the original copy of the row in the table file and writes a new one. The original row, marked as expired, is used by other transactions still viewing the database in its prior state. Deletions are similarly marked as expired, but not removed from the table file.
The VACUUM command removes expired rows from the file. In the process, it moves rows from the end of the table into the expired spots, thereby compacting the table file.
You should run VACUUM periodically to clean out expired rows. For tables that are heavily modified, it is useful to run VACUUM every night in an automated manner. For tables with few modifications, VACUUM should be run less frequently. The command exclusively locks the table while processing.
You can run VACUUM in two ways. Using VACUUM alone vacuums all tables in the database. Using VACUUM tablename vacuums a single table.
VACUUM ANALYZE
The VACUUM ANALYZE command resembles VACUUM, but also collects statistics about each column's proportion of duplicate values and the maximum and minimum values. POSTGRESQL uses this information when deciding how to efficiently execute complex queries. You should run VACUUM ANALYZE when a table is initially loaded and when a table's data changes dramatically.
The VACUUM manual page shows all of the VACUUM options.
EXPLAIN
EXPLAIN causes POSTGRESQL to display how a query will be executed,
rather than executing it. As an example, this figure shows a SELECT
query preceeded by the word EXPLAIN.
test=> EXPLAIN SELECT customer_id FROM customer;
NOTICE: QUERY PLAN:
Seq Scan on customer (cost=0.00..15.00 rows=1000 width=4)
EXPLAIN
In the figure, POSTGRESQL reports a sequential scan will be used on customer, meaning it will read the entire table. The cost is an estimate of the work required to execute the query (the numbers are only meaningful for comparison). The rows indicates the number of result rows expected. The width is the number of bytes per row.
The next figure shows more interesting examples of EXPLAIN.
test=> EXPLAIN SELECT customer_id FROM customer WHERE customer_id = 55;
NOTICE: QUERY PLAN:
Seq Scan on customer (cost=0.00..22.50 rows=10 width=4)
EXPLAIN
test=> VACUUM ANALYZE customer;
VACUUM
test=> EXPLAIN SELECT customer_id FROM customer WHERE customer_id = 55;
NOTICE: QUERY PLAN:
Seq Scan on customer (cost=0.00..17.50 rows=1 width=4)
EXPLAIN
test=> CREATE UNIQUE INDEX customer_custid_idx ON customer (customer_id);
CREATE
test=> EXPLAIN SELECT customer_id FROM customer WHERE customer_id = 55;
NOTICE: QUERY PLAN:
Index Scan using customer_custid_idx on customer (cost=0.00..2.01 rows=1 width=4)
EXPLAIN
test=> EXPLAIN SELECT customer_id FROM customer;
NOTICE: QUERY PLAN:
Seq Scan on customer (cost=0.00..15.00 rows=1000 width=4)
EXPLAIN
test=> EXPLAIN SELECT * FROM customer ORDER BY customer_id;
NOTICE: QUERY PLAN:
Index Scan using customer_custid_idx on customer (cost=0.00..42.00 rows=1000 width=4)
EXPLAIN
The first EXPLAIN shows a SELECT with the restriction customer_id = 55. The command reports another sequential scan, but the restriction causes POSTGRESQL to estimate that ten rows will be returned. A VACUUM ANALYZE command is then run, causing the next query to properly estimate that one row will be returned instead of ten. An index is created, and the query rerun. This time, an index scan is used, allowing POSTGRESQL to go directly to the rows where customer_id equals 55. The next EXPLAIN shows a query with no WHERE restriction. POSTGRESQL realizes that the index is useless and performs a sequential scan. The last query has an ORDER BY that matches an index, so POSTGRESQL uses an index scan.
Even more complex queries can be studied using EXPLAIN, as shown in
this figure.
test=> EXPLAIN SELECT * FROM tab1, tab2 WHERE col1 = col2;
NOTICE: QUERY PLAN:
Merge Join (cost=139.66..164.66 rows=10000 width=8)
-> Sort (cost=69.83..69.83 rows=1000 width=4)
-> Seq Scan on tab2 (cost=0.00..20.00 rows=1000 width=4)
-> Sort (cost=69.83..69.83 rows=1000 width=4)
-> Seq Scan on tab1 (cost=0.00..20.00 rows=1000 width=4)
EXPLAIN
In this example, tab1 and tab2 are joined on col1 and col2. Each table is sequentially scanned, and the result sorted. The two results are then merge joined to produce output. It also supports hash join and nested loop join methods. It chooses the join method it believes to be the fastest.
Summary
A variety of tools are available to speed up POSTGRESQL queries. Although their use is not required, they can produce huge improvements in query speed. Another article outlines more steps that database administrators can take to improve performance.
About the Author
Bruce Momjian is Vice-President of Database Development at Great Bridge, LLC, http://www.greatbridge.com and author of PostgreSQL: Introduction and Concepts, ©2001, Addison-Wesley, ISBN 0-201-70331-9. This article is taken from that book. The web site for the book is http://www.postgresql.org/docs/awbook.html.