Nested Data Mining Columns
One of the more advanced features of the data mining column structure is the ability to nest data mining columns. Data mining models use this nested column structure for both input and output data, as the syntax used to populate a data mining model with training data allows nested columns to be represented as subqueries. Data mining cases may not be easily described by using typical relational tables; a single case may depend on several groups of supporting information to perform predictive analysis. To illustrate this point, consider the case of a telephone company customer: A customer may have multiple telephone lines and multiple ISP accounts.
To retrieve all of the customer information, all of the telephone lines for each customer, and all of the ISP accounts for each customer, several approaches could be used:
- Employ three queries, iterating through the customer query and issuing the same two queries for retrieving telephone lines and ISP accounts over and over again for each row in the customer query.
- Employ two queries, joining the Customers table with the Telephone Lines table for one query and joining the Customers table with the ISP Accounts table for the second query.
- Employ one query with a UNION to join the two subqueries used in the previous approach, constructing empty columns in each subquery to represent the return values of the other subquery.
All three of these approaches are ungainly, involve repetitive data and action, and are highly inefficient.
However, if a single column could hold a group of columns, you could construct a single query that would return one row per customer in the Customers table containing all of the columns in the Customers table, an additional column containing all of the Telephone Lines rows for a given customer, and an additional column containing all of the ISP Accounts rows for a given customer, as shown in the following diagram.
As the diagram shows, there is no redundant data for the customer in the returned rowset; one row per customer is all that is needed, and the nested columns of the rowset contain the data pertinent to that customer. Rowsets constructed in this fashion, referred to as hierarchical rowsets, are fully supported by OLE DB.
Case information for a data mining model may not reside in a single case table, but may have supporting tables supplying additional information to define the case. In the diagram, the Telephone Lines and ISP Accounts tables serve as supporting tables for the Customers case table. They provide additional information about the case, such as the number and type of ISP accounts the customer may possess, or the number of telephone lines used by the customer. The data mining model can take advantage of nested data mining columns to process this supporting information and create additional rules and patterns for the customer based on the data in the supporting tables.