Ingredients¶
Ingredients are the building block of recipe.
Ingredients can contain columns that are part of the SELECT
portion of a query,
filters that are part of a WHERE
clause of a query, group_bys that
contribute to a query’s GROUP BY
and havings which add HAVING
limits
ot a query.
Creating ingredients in python¶
Ingredients can be created either in python or via configuration. To created Ingredients in python, use one of the four convenience classes.
Metric: Create an aggregated calculation using a column. This value appears only in the SELECT part of the SQL statement.
Dimension: Create a non-aggregated value using a column. This value appears in the SELECT and GROUP BY parts of the SQL statement.
Filter: Create a boolean expression. This value appears in the WHERE part of the SQL statement. Filters can be created automatically using the AutomaticFilters extension or by using a Dimension or Metric’sales
build_filter
method.Having: Create a boolean expression with an aggregated ColumnElement. This value appears in the HAVING part of the SQL statement.
Metrics and Dimensions are commonly reused in working Recipe code, while filters are often created temporarily based on data.
Features of ingredients¶
Let’s explore some capabilities.
Formatters¶
Formatters are a list of python callables that take a single value. This
let you manipulate the results of an ingredient with python code. If you use
formatters, the original, unmodified value is available as {ingredient}_raw
.
shelf = Shelf({
'state': Dimension(Census.state),
'age': WtdAvgMetric(Census.age, Census.pop2000),
'gender': Dimension(Census.gender),
'population': Metric(func.sum(Census.pop2000), formatters=[
lambda value: int(round(value, -6) / 1000000)
])
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('gender').metrics('population')
for row in recipe.all():
print('{} has {} people'.format(row.gender, row.population))
print('\tThe original value is: {}'.format(row.population_raw))
The results look like
F has 144 million people
The original value is: 143534804
M has 137 million people
The original value is: 137392517
Building filters¶
Ingredient.build_filter
Storing extra attributes in meta¶
Extra keyword arguments that get passed to ingredient initialization
get stored in the meta
object. This can be used to extend the
capabilities of ingredients and add extra features.
d = Dimension(Census.age, icon='cog')
print(d.meta.icon)
>>> 'cog'
Types of Ingredients¶
List of ingredients
Dimension¶
Dimensions are groupings that exist in your data. Dimension objects add the column to the select statement and the group by of the SQL query.
# A simple dimension
self.shelf['state'] = Dimension(Census.state)
Adding an id¶
Dimensions can use separate columns for ids and values. Consider a
table of employees with an employee_id
and a full_name
. If you had
two employees with the same name you need to be able to distinguish between
them.
# Support an id and a label
self.shelf['employee']: Dimension(Employee.full_name,
id_expression=Employee.id)
The id is accessible as employee_id
in each row and their full name is
available as employee
.
If you build a filter using this dimension, it will filter against the id.
Adding an ordering¶
If you want to order a dimension in a custom way, pass a keyword argument
order_by_expression
. This code adds an order_by_expression that causes the
values to sort case insensitively.
from sqlalchemy import func
# Support an id and a label
self.shelf['employee']: Dimension(Employee.full_name,
order_by_expression=func.lower(
Employee.full_name
))
The order_by expression is accessible as employee_order_by
in each row and
the full name is available as employee
. If the employee dimension is used in a
recipe, the recipe will always be ordered by func.lower(Employee.full_name)
.
Adding additional groupings¶
Both id_expression
and order_by_expression
are special cases of Dimension’s
ability to be passed additional columns can be used for grouping. Any keyword argument
suffixed with _expression
adds additional roles to this Dimension. The first
required expression supplies the dimension’s value role. For instance,
you could create a dimension with an id
, a latitude
and a longitude
.
For instance, the following
Dimension(Hospitals.name,
latitude_expression=Hospitals.lat
longitude_expression=Hospitals.lng,
id='hospital')
would add columns named “hospital”, “hospital_latitude”, and “hospital_longitude” to the recipes results. All three of these expressions would be used as group bys.
Using lookups¶
You can use a lookup table to map values in your data to descriptive names. The _id
property of your dimension contains the original value.
# Convert M/F into Male/Female
self.shelf['gender']: Dimension(Census.sex, lookup={'M': 'Male',
'F': 'Female'}, lookup_default='Unknown')
If you use the gender dimension, there will be a gender_id
in each row
that will be “M” or “F” and a gender
in each row that will be “Male” or
“Female”.
shelf = Shelf({
'state': Dimension(Census.state),
'gender_desc': Dimension(Census.gender, lookup={'M': 'Male',
'F': 'Female'}, lookup_default='Unknown'),
'age': WtdAvgMetric(Census.age, Census.pop2000),
'population': Metric(func.sum(Census.pop2000))
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('gender_desc').metrics('population')
print(recipe.to_sql())
print(recipe.dataset.csv)
Lookups inject a formatter in the first position. Because a formatter
is used, recipe creates a gender_desc_raw
on the response that
contains the unformatted value then uses the lookup to create the gender_desc
property. All dimensions also generate an {ingredient}_id
property.
Here is the query and the results.
SELECT census.gender AS gender_desc_raw,
sum(census.pop2000) AS population
FROM census
GROUP BY census.gender
gender_desc_raw,population,gender_desc,gender_desc_id
F,143534804,Female,F
M,137392517,Male,M
Metric¶
Metrics are aggregations performed on your data. Here’s an example of a few Metrics.
shelf = Shelf({
'total_population': Metric(func.sum(Census.pop2000)),
'min_population': Metric(func.min(Census.pop2000)),
'max_population': Metric(func.max(Census.pop2000))
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.metrics('total_population', 'min_population', 'max_population')
print(recipe.to_sql())
print(recipe.dataset.csv)
The results of this recipe are:
SELECT max(census.pop2000) AS max_population,
min(census.pop2000) AS min_population,
sum(census.pop2000) AS total_population
FROM census
max_population,min_population,total_population
294583,217,280927321
DivideMetric¶
Division in SQL introduces the possibility of division by zero. DivideMetric guards against division by zero while giving you a quick way to divide one calculation by another.
shelf = Shelf({
'state': Dimension(Census.state),
'popgrowth': DivideMetric(func.sum(Census.pop2008-Census.pop2000), func.sum(Census.pop2000)),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state').metrics('popgrowth')
This creates results like:
SELECT census.state AS state,
CAST(sum(census.pop2008 - census.pop2000) AS FLOAT) /
(coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS popgrowth
FROM census
GROUP BY census.state
state,popgrowth,state_id
Alabama,0.04749469366071285,Alabama
Alaska,0.09194726152996757,Alaska
Arizona,0.2598860676785905,Arizona
Arkansas,0.06585681816651036,Arkansas
California,0.0821639328251409,California
Colorado,0.14231283526592364,Colorado
...
The denominator has a tiny value added to it to prevent division by zero.
WtdAvgMetric¶
WtdAvgMetric
generates a weighted average of a number using a weighting.
Warning
WtdAvgMetric
takes two ColumnElements as arguments. The first is the value
and the second is the weighting. Unlike other Metrics, these are not aggregated.
Here’s an example.
shelf = Shelf({
'state': Dimension(Census.state),
'avgage': WtdAvgMetric(Census.age, Census.pop2000),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state').metrics('avgage')
print(recipe.to_sql())
print(recipe.dataset.csv)
This generates results that look like this:
SELECT census.state AS state,
CAST(sum(census.age * census.pop2000) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS avgage
FROM census
GROUP BY census.state
state,avgage,state_id
Alabama,36.27787892421841,Alabama
Alaska,31.947384766048568,Alaska
Arizona,35.37065466080318,Arizona
Arkansas,36.63745110262778,Arkansas
California,34.17872597484759,California
...
Note: WtdAvgMetric uses safe division from DivideMetric
.
Filter¶
Filter objects add a condition to the where clause of your SQL query. Filter objects can be added to a Shelf.
shelf = Shelf({
'state': Dimension(Census.state),
'population': Metric(func.sum(Census.pop2000)),
'teens': Filter(Census.age.between(13,19)),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('population')\
.filters('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
This results in output like:
SELECT census.state AS state,
sum(census.pop2000) AS population
FROM census
WHERE census.age BETWEEN 13 AND 19
GROUP BY census.state
state,population,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Different ways of generating Filters¶
Recipe has several ways of filtering recipes.
Filter objects can be added to the shelf. They can be added to the recipe by name from a shelf. This is best when you have a filter that you want to use in many place.
shelf = Shelf({ 'age': Dimension(Census.age), 'state': Dimension(Census.state), 'population': Metric(func.sum(Census.pop2000)), 'teens': Filter(Census.age.between(13,19)), }) ... recipe = recipe.filters('teens')
Filter objects can be created dynamically and added to the recipe. This is best if the filtering needs to change dynamically.
recipe = recipe.filters(Filter(Census.age.between(13,19))
Ingredient.build_filter can be used to build filters that refer to the ingredient’s column.
age_filter = shelf['age'].build_filter([13,19], 'between') recipe = recipe.filters(age_filter)
This is best when you want to reuse a column definition defined in an ingredient.
AutomaticFilters: The AutomaticFilters extension adds filtering syntax directly to recipe.
recipe = recipe.automatic_filters({ 'age__between': [13,19] })
This is best when you want to add many filters consistently. AutomaticFilters uses
Ingredient.build_filter
behind the scenes.
Having¶
Having objects are binary expressions with an aggregated column value.
One easy way to generate Having
objects is to build_filter
using
a Metric
.
shelf = Shelf({
'age': Dimension(Census.age),
'avgage': WtdAvgMetric(Census.age, Census.pop2000),
'state': Dimension(Census.state),
'population': Metric(func.sum(Census.pop2000)),
})
# Find states with a population greater than 15 million
big_states = shelf['population'].build_filter(15000000, operator='gt')
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('population')\
.order_by('-population')\
.filters(big_states)
print(recipe.to_sql())
print(recipe.dataset.csv)
This generates the following results.
SELECT census.state AS state,
sum(census.pop2000) AS population
FROM census
GROUP BY census.state
HAVING sum(census.pop2000) > 15000000
ORDER BY sum(census.pop2000) DESC
state,population,state_id
California,33829442,California
Texas,20830810,Texas
New York,18978668,New York
Florida,15976093,Florida