Month: March 2010

Oracle Plan Stability (Stored Outlines) in PeopleSoft Global Payroll

A longer version of this posting, with experimental results, is available on my website.

In PeopleSoft for the Oracle DBA, I wrote a page (p. 291) explaining why stored outlines were not suitable for use in PeopleSoft.  Five years later, my view has not significantly changed.  Essentially, stored outlines work best with shared SQL, and there isn’t much shared SQL in PeopleSoft, because a lot of it is dynamically generated.

  • Code generated by the component processor is dynamically generated.  At save time, only fields that have changed are updated.
  • PeopleCode can written in such a way that where clauses are dynamically assembled
  • nVision reports have variable numbers of criteria on literal tree node IDs in the queries.
  • By default in Application Engine, bind variables are converted to literals before the SQL is submitted to the database.  Even if this is overridden by enabling ReUseStatement in Application Engine or by using Cursor Sharing in the database, the code still wouldn’t be sharable.  Different instances of Application Engine executing the same program can use different instances on non-shared temporary records, so the tables in otherwise identical SQL statements are not the same.  You would get one version of the statement per temporary table instance.

However, there are very limited exceptions to this rule (otherwise I wouldn’t have a story to tell).  The SQL in COBOL and SQR programs are more likely to be shareable.   Although some programs are coded to generate SQL dynamically, bind variables are passed through to SQL statements, and they use regular tables for working storage and not PeopleSoft temporary records.
A Global Payroll customer came to me with a problem where the payroll calculation (GPPDPRUN) would usually run well, but sometimes, the execution plan of a statement would change and the calculation would take additional several hours.  It is significant that the Global Payroll engine is written in COBOL.  My usual response to this sort of problem in Global Payroll is to add a hint to the stored statement.  Usually, I find that only a few statements that are affected.  However, after this happened a couple of times in production, it was clear that we couldn’t continue to react to these problems. We needed to proactively stop this happening again.  This is exactly what stored outlines are designed to do.

Using Stored Outlines in the PeopleSoft GP Engine

Earlier I said that we could apply stored outlines to the Global Payroll calculation engine (GPPDPRUN) because it generally doesn’t use dynamic code with embedded literal values.  
While outlines are being created, the following privilege needs to be granted.  It can be revoked later.

GRANT CREATE ANY OUTLINE TO SYSADM;

We can create a trigger to collect the stored outlines for a payroll calculation, thus:

  • The trigger fires when a payroll calculation process starts or finishes. 
  • At the start a payroll process it starts collecting stored outlines in a category called the same as the process; GPPDPRUN.
  • When the process finishes, outline collection is disabled by setting it back to false.
CREATE OR REPLACE TRIGGER sysadm.gfc_create_stored_outlines
BEFORE UPDATE OF runstatus ON sysadm.psprcsrqst
FOR EACH ROW
WHEN (new.prcsname = 'GPPDPRUN' AND (new.runstatus = 7 OR old.runstatus = 7))
DECLARE
l_sql VARCHAR2(100);
BEGIN
l_sql := 'ALTER SESSION SET create_stored_outlines = ';
IF :new.runstatus = 7 THEN
EXECUTE IMMEDIATE l_sql||:new.prcsname;
ELSIF :old.runstatus = 7 THEN
EXECUTE IMMEDIATE l_sql||'FALSE';
END IF;
EXCEPTION WHEN OTHERS THEN NULL; --because I dont want to crash the process scheduler
END;
/

The exact number of outlines that are collected during this process will vary depending upon configuration, and which employees are processed as different payroll rules are invoked.
If no more outlines are to be collected the CREATE ANY OUTLINE privilege can be revoked.  This does not prevent the outlines from being used.

REVOKE CREATE ANY OUTLINE FROM SYSADM;

Then, the category of outlines can be used in subsequent executions by replacing the trigger above with the one below, and the execution plans cannot change so long as the SQL doesn’t change.

CREATE OR REPLACE TRIGGER sysadm.gfc_use_stored_outlines
BEFORE UPDATE OF runstatus ON sysadm.psprcsrqst
FOR EACH ROW
WHEN (new.prcsname = 'GPPDPRUN' AND (new.runstatus = 7 OR old.runstatus = 7))
DECLARE
l_sql VARCHAR2(100);
BEGIN
l_sql := 'ALTER SESSION SET use_stored_outlines = ';
IF :new.runstatus = 7 THEN
EXECUTE IMMEDIATE l_sql||:new.prcsname;
ELSIF :old.runstatus = 7 THEN
EXECUTE IMMEDIATE l_sql||'FALSE';
END IF;
EXCEPTION WHEN OTHERS THEN NULL; --because I dont want to crash the process scheduler
END;
/

After running an identify-and-calc and a cancel, we can see how many of the outlines are actually used.

SELECT category, count(*) outlines
, sum(decode(used,'USED',1,0)) used
FROM user_outlines
GROUP BY category
ORDER BY 1
/

I have a large number of unused outlines because of additional recursive SQL generated because OPTIMIZER_DYNAMIC_SAMPLING was set 4. This does not occur if this parameter is set to the default of 2.

CATEGORY                         OUTLINES       USED
------------------------------ ---------- ----------
GPPDPRUN 572 281

I can then remove the unused outlines.

EXECUTE dbms_outln.drop_unused;

Used flags on the outlines can be reset, so we later we can see the outlines being used again.

BEGIN
FOR i IN (SELECT * FROM user_outlines WHERE category = 'GPPDPRUN') LOOP
dbms_outln.clear_used(i.name);
END LOOP;
END;
/

If I want to go back running without outlines, I just disable the trigger

ALTER TRIGGER sysadm.stored_outlines DISABLE;

To re-enable outlines, just re-enable the trigger.

ALTER TRIGGER sysadm.stored_outlines ENABLE;

Conclusions

Stored Outlines have very limited application in a PeopleSoft system.  However, they can easily be collected and used with the PeopleSoft Global Payroll engine.  It is just a matter of granting a privilege and using the database triggers on the process request table.
Testing that they actually have the desired effect is quite difficult, because you are trying to prove a negative.  I don’t think it is adequate simply to say that the outline has been used.

  • First you would need an environment where payroll calculation performs well, where you could collect outlines.
  • Then you would need a payroll calculation that performs poorly because the execution plan for at least one SQL statement is different
    • Either, on a second environment with exactly the same code.
    • Or in the same environment on a different set of data.
  • Then, it would be possible to demonstrate that applying the outline causes the execution plan to revert and restores the performance.  This can be confirmed by comparison of the ASH data for the various scenarios.

Even if you don’t want to use a stored outline immediately, it might be advantageous to collect them, and have them available when you do encounter a problem.

Distinguishing Two Major Types of Column-Stores

I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency (e.g. here, here, and here), due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately. I believe that calling both groups of systems column-stores has lead to a great deal of confusion and misplaced expectations. This blog post attempts to clear up some of this confusion, highlighting the high level differences between these groups of systems. At the end, I will propose some potential ways to rename these groups to avoid confusion in the future.

For this blog post, I will refer to the following two groups as Group A and Group B:
  • Group A: Bigtable, HBase, Hypertable, and Cassandra. These four systems are not intended to be a complete list of systems in Group A — these are simply the four systems I understand the best in this category and feel the most confident discussing.
  • Group B: Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright. Again, this is not a complete list, but these are the systems from this group I know best. (Row/column hybrid systems such as Oracle or Greenplum are ignored from this discussion to avoid confusion, but the column-store aspects of these systems are closer to Group B than Group A.)
Differences between Group A and Group B
  1. Data Model. Group A uses a multi-dimensional map (something along the lines of a sparse, distributed, persistent multi-dimensional sorted map). Typically a row-name, column-name, and timestamp are sufficient to uniquely map to a value in the database. Group B uses a traditional relational data model. This distinction has caused great confusion. People more familiar with Group A are very much aware that Group A does not use a relational data model and assume that since Group B are also called column-stores, then Group B also does not use a relational data model. This has resulted in many intelligent people saying “column-stores are not relational”, which is completely incorrect.
  2. Independence of Columns. Group A stores parts of a data entity or “row” in separate column-families, and has the ability to access these column-families separately. This means that not all parts of a row are picked up in a single I/O operation from storage, which is considered a good thing if only a subset of a row is relevant for a particular query. However, column-families may consist of many columns, and these columns within column-families are not independently accessible.

    Group B stores columns from a traditional relational database table separately so that they can be accessed independently. Like Group A, this is useful for queries that only access a subset of table attributes in any particular query. However, the main difference is that every column is stored separately, instead of families of columns as in Group A (this statement ignores fine-grained hybrid options within Group B).

  3. Interface. Group A is distinguished by being part of the NoSQL movement and does not typically have a traditional SQL interface. Group B supports standard SQL interfaces.
  4. Optimized workload. Group B is optimized for read-mostly analytical workloads. These systems support reasonably fast load times, but high update rates tend to be problematic. Hence, data warehouses are an ideal market for Group B, since they are typically bulk-loaded, require many complex read queries, and are updated rarely. In contrast, Group A can handle a more diverse set of application requirements (Cassandra, in particular, can handle a much higher rate of updates). Group B systems tend to struggle on workloads that “get” or “put” individual rows in the data set, but thrive on big aggregations and summarizations that require scanning many rows as part of a single query. In contrast, Group A generally does better for individual row queries, and does not perform well on aggregation-heavy workloads. Much of the reason for this difference can be explained in the “pure column” vs “column-family” difference between the systems. Group A systems can put attributes that tend to be co-accessed in the same column-family; this saves the seek cost that results from column-stores needing to find different attributes from the same row in many different places. Another reason for the difference is the storage layer implementation, explained below.
  5. Storage layer. Assume the following customer table

    Although there is some variation within the systems in Group B, to the first order of approximation, this group will store the table in the following way:

    (ID) 1, 2, 3, 4, 5, 6

    (First Name) Joe, Jack, Jill, James, Jamie, Justin

    (Last Name) Smith, Williams, Davis, Miller, Wilson, Taylor

    (Phone) 555-1234, 555-5668, 555-5432, NULL, 555-6527, 555-8247

    (Email) jsmith@gmail.com, jwilliams@gmail.com, NULL, jmiller@yahoo.com, NULL, jtaylor@aol.com

    Note that each value is stored by itself, without information about what row or column it came from. We can figure out what column it came from since all values from the same column are stored consecutively. We can figure out what row it came from by counting how many values came before it in the same column. The fourth value in the id column matches up to the same row as the fourth value in the last name column and the fourth value in the phone column, etc. Note that this means that columns that are undefined for a particular row must be explicitly stored as NULL in the column list; otherwise we can no longer match up values based on their position in their corresponding lists.

    Meanwhile, systems in Group A will either explicitly store the row-name, the column-name or both with each value. E.g.: row2, lastname: Williams; row5, phone: 555-6527, etc. The reason is that Group A uses a sparse data model (different rows can have a very different set of columns defined). Storing NULLs for every undefined column could soon lead to the majority of the database being filled with NULLs. Hence these systems will explicitly have column-name/value pairs for each element in a row within a column-family, or row-name/value pairs for each element within a single column column-family. (Group A will also typically store a timestamp per value, but explaining this will only complicate this discussion).

    This results in Group B typically taking much less space on storage than Group A (at least for structured data that easily fits into a relational model). Furthermore, by storing just column-values without column-names or row-names, Group B optimizes performance for column-operations where each element within a column is read and an operation (like a predicate evaluation or an aggregation) is applied. Hence, the data model combined with the storage layer implementation results in wildly different target applications for Group A and Group B.

Renaming the Groups

Clearly along each of these five dimensions, Group A and Group B are extremely different. Consequently, even though calling them both column-stores has some advantages (it makes it seem like the “column-store movement” is large and a really hot area), I believe that lumping Group A and Group B together does more damage than good, and that we need to make a greater effort to avoid confusing these two groups in the future. Here are some suggestions for names for Group A and Group B towards this goal:

  • Group A: “column-family store”
    Group B: “column-store”

    (The problem here is that Group B doesn’t get a new name, which means that “column-store” could either refer to Group B or both Group A/B)

  • Group A: “non-relational column-store”
  • Group B: “relational column-store”
  • Group A: “sparse column-store”
  • Group B: “dense column-store”
Of these, the relational/non-relational distinction is probably the most important, and would be my vote for the new names. If you have a different idea for names, or want to vote on one of the above schemes, please leave a comment below.


Addendum:

Mike Stonebraker e-mailed me his vote which I reproduce here with his permission:

Group A are really row stores. I.e. they store a column family in a row-by-row fashion. Effectively, they are using materialized views for column families and storing materialized views row-by-row. Systems in Group B have a sophisticated column-oriented optimizer — no such thing exists for Group A.

He then went on to call Group A “MV row-stores” and sub-categorize Group B depending on how materialized views are stored, but his overarching point was that referring to anything in Group A as a “column-store” is really misleading.


Addendum 2:

This post seems to have attracted some high quality comments. I recommend reading the comment thread.

Capturing DDL for Database Objects Not Managed by PeopleTools

I have written before about the challenges of managing database objects and attributes of database objects that are not managed by PeopleTools Application Designer. I proposed a DDL trigger to prevent such objects being dropped or altered. However, sometimes it is necessary to temporarily disable this DDL trigger, such as during patch or upgrade release it is necessary to disable this trigger to apply the changes.

Now, I have another DDL trigger and a packaged procedure that captures the DDL to recreate objects that are recursively dropped (such as DML triggers on tables). The DDL is stored in a database table. This trigger can remain permanently enabled, and the table it maintains can be used to see what objects are missing, as well as holding the SQL to rebuild them.

An common example of where this is can be valuable is where a system uses database triggers to capture audit data.  This method is often preferred because it generally performs better than having the application server generate additional DML to the audit table, and also captures updates made in other processes.  PeopleSoft even deliver processes to generate the DML triggers that write to the audit tables.  However, if you alter the table in Application Designer, perhaps only because you are applying a PeopleSoft fix, and apply the changes to the database by recreating the table, then the trigger will be lost.  It is then up to the customer to make sure the audit trigger is replaced.  There is absolutely nothing to warn you that the trigger is lost, and the application will still function without the trigger, but your updates will not be audited.

When a table is dropped, the trigger calls a procedure in the package that checks for:

  • indexes that are not managed by PeopleTools (such as function-based indexes),
  • triggers not managed by PeopleTools (other than the PSU triggers created for mobile agents),
  • materialised view logs.
  • If the table is partitioned or global temporary the DDL for the object being dropped is also captured.

When an index is dropped the index check is performed. Similarly the DDL to rebuild partitioned indexes or indexes on Global Temporary tables is also captured.

When an object for which the DDL has been captured is explicitly dropped, this is indicated on the table GFC_REL_OBJ by storing the time at which it was dropped. When it is recreated this time-stamp is cleared.  Thus it is possible to decide whether something was deliberately or accidentally dropped.

DECODE/CASE vs. Mapping Tables

I was helping a colleague recently with some SQL. He had about a dozen SQL queries that all looked something like this:SELECT (some stuff),DECODE (status, ‘A’, 1, ‘I’, 0),(more stuff)FROM(wherever)I made a few observations about his DECODE statement:1…

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569