Saltar al contenido principal
waffle.svg
Domo

Optimizing an SQL DataFlow

Version 1

 

Important: When referencing this page outside of Knowledge Base, use this link: http://knowledge.domo.com?cid=optimizingdataflow

Intro

You can optimize your DataFlows by creating indices in the DataFlows and limiting your data. Doing so will help eliminate "Timed Out" errors by allowing your DataFlows to run more efficiently.

MySQL

Whenever you do a join in a DataFlow, you should index the columns you are joining on.

For example:

Select a.`column1`, a.`column2`, b.`column3`, b.`column4` FROM input_dataset_1 a

LEFT JOIN input_dataset_2 b

ON a.`column1` = b.`column3`

You should index the two columns in the ON statement. You can do this by adding a new transform BEFORE the transform where the join is happening.

To create this transform,

  1. Uncheck the Generate Output Table box in the lower left. 
  2. Write your query.

The query is divided into two sections. The first part is:

ALTER TABLE [table_name]

This defines the table you are making the index on or "altering."

The second part is:

ADD INDEX([column name]), ADD INDEX([column name]), ADD INDEX([column name])...

This defines each of the columns you are indexing. You can also index all columns together instead of one at a time by using ADD INDEX([column list comma separated]), but the query generally runs better if you index columns one by one. 

You will need to create an indexing transform for all tables being used in the join. In the previous example, you would have a transform indexing input_dataset_1 as well as input_dataset_2.   

Once finished, your DataFlow should look similar to the following screenshot:

 

Important: In the join transform, do not use statements with * such as SELECT * or SELECT a.*, as this makes the indexing invalid.   

MySQL and Redshift

Indexing in MySQL does not always optimize the DataFlow to the full extent needed. Here are additional things you can do to optimize your DataFlow:

  • Filter the columns being brought into the DataFlow. Some larger DataSets have a large number of columns, not all of which are needed. You can click on the Dataset in the Input Datasets section to limit what columns are being brought into the DataFlow.

  • Filter your data. The most common way to do this is by using a data filter. For example:

    SELECT `date`, `value`, `series`
    FROM input_dataset_1
    WHERE YEAR(`date`) = YEAR(CURRENT_DATE())

     

  • Take advantage of the GROUP BY function. When filtering data, you sometimes end up with duplicate rows because they were broken out by another column in the original data. If you aggregate value columns then apply a GROUP BY to the remaining columns, you can condense the number of rows. The most common aggregation is SUM. Here is an example:

    SELECT `Date`, `Series`, SUM(`value`) AS ‘value’
    FROM input_dataset_1
    GROUP BY `Date`, `Series`

     

  • If you have a transform with multiple JOINs, you can break them up into multiple transforms. For example:

    SELECT a.`column`,a.`column4`, b.`column2`,b.`column5`,c.`column6`, c.`column3`
    FROM input_dataset_1
    LEFT JOIN input_dataset_2 b
    ON a.`column4` = b.`column2`
    LEFT JOIN input_dataset_3 c
    ON b.`column5` = c.`column3`

  • You can also split a transform into two transforms. For example:

    Transform 1

    SELECT a.`column`,a.`column4`, b.`column2`,b.`column5`
    FROM input_dataset_1
    LEFT JOIN input_dataset_2 b
    ON a.`column4` = b.`column2`


    Transform 2

    SELECT a.`column`,a.`column4`, a.`column2`,a.`column5`,c.`column6`, c.`column3`
    FROM transform_data_1 a
    LEFT JOIN input_dataset_3 c
    ON a.`column5` = c.`column3`

Make sure that if you are using MySQL, you index the columns from transform_data_1 before doing the join, as well as any additional indexing you didn't create. 

If none of these methods helps optimize your data, please reach out to Domo Support.