=========== Ingredients =========== Ingredients are the building block of recipe. Ingredients can contain columns that are part of the ``SELECT`` portion of a query, filters that are part of a ``WHERE`` clause of a query, group_bys that contribute to a query's ``GROUP BY`` and havings which add ``HAVING`` limits ot a query. ------------------------------ Creating ingredients in python ------------------------------ Ingredients can be created either in python or via configuration. To created Ingredients in python, use one of the four convenience classes. * **Metric**: Create an aggregated calculation using a column. This value appears only in the SELECT part of the SQL statement. * **Dimension**: Create a non-aggregated value using a column. This value appears in the SELECT and GROUP BY parts of the SQL statement. * **Filter**: Create a boolean expression. This value appears in the WHERE part of the SQL statement. Filters can be created automatically using the AutomaticFilters extension or by using a Dimension or Metric'sales ``build_filter`` method. * **Having**: Create a boolean expression with an aggregated ColumnElement. This value appears in the HAVING part of the SQL statement. Metrics and Dimensions are commonly reused in working Recipe code, while filters are often created temporarily based on data. Features of ingredients ----------------------- Let's explore some capabilities. Formatters ~~~~~~~~~~ Formatters are a list of python callables that take a single value. This let you manipulate the results of an ingredient with python code. If you use formatters, the original, unmodified value is available as ``{ingredient}_raw``. .. code:: python shelf = Shelf({ 'state': Dimension(Census.state), 'age': WtdAvgMetric(Census.age, Census.pop2000), 'gender': Dimension(Census.gender), 'population': Metric(func.sum(Census.pop2000), formatters=[ lambda value: int(round(value, -6) / 1000000) ]) }) recipe = Recipe(shelf=shelf, session=oven.Session())\ .dimensions('gender').metrics('population') for row in recipe.all(): print('{} has {} people'.format(row.gender, row.population)) print('\tThe original value is: {}'.format(row.population_raw)) The results look like .. code:: F has 144 million people The original value is: 143534804 M has 137 million people The original value is: 137392517 Building filters ~~~~~~~~~~~~~~~~ Ingredient.build_filter Storing extra attributes in meta ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Extra keyword arguments that get passed to ingredient initialization get stored in the ``meta`` object. This can be used to extend the capabilities of ingredients and add extra features. .. code:: python d = Dimension(Census.age, icon='cog') print(d.meta.icon) >>> 'cog' -------------------- Types of Ingredients -------------------- List of ingredients Dimension --------- Dimensions are groupings that exist in your data. Dimension objects add the column to the select statement and the group by of the SQL query. .. code-block:: python # A simple dimension self.shelf['state'] = Dimension(Census.state) Adding an id ~~~~~~~~~~~~ Dimensions can use separate columns for ids and values. Consider a table of employees with an ``employee_id`` and a ``full_name``. If you had two employees with the same name you need to be able to distinguish between them. .. code-block:: python # Support an id and a label self.shelf['employee']: Dimension(Employee.full_name, id_expression=Employee.id) The id is accessible as ``employee_id`` in each row and their full name is available as ``employee``. If you build a filter using this dimension, it will filter against the id. Adding an ordering ~~~~~~~~~~~~~~~~~~ If you want to order a dimension in a custom way, pass a keyword argument ``order_by_expression``. This code adds an order_by_expression that causes the values to sort case insensitively. .. code-block:: python from sqlalchemy import func # Support an id and a label self.shelf['employee']: Dimension(Employee.full_name, order_by_expression=func.lower( Employee.full_name )) The order_by expression is accessible as ``employee_order_by`` in each row and the full name is available as ``employee``. If the `employee` dimension is used in a recipe, the recipe will **always** be ordered by ``func.lower(Employee.full_name)``. Adding additional groupings ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Both ``id_expression`` and ``order_by_expression`` are special cases of Dimension's ability to be passed additional columns can be used for grouping. Any keyword argument suffixed with ``_expression`` adds additional roles to this Dimension. The first *required* expression supplies the dimension's value role. For instance, you could create a dimension with an ``id``, a ``latitude`` and a ``longitude``. For instance, the following .. code-block:: python Dimension(Hospitals.name, latitude_expression=Hospitals.lat longitude_expression=Hospitals.lng, id='hospital') would add columns named "hospital", "hospital_latitude", and "hospital_longitude" to the recipes results. All three of these expressions would be used as group bys. Using lookups ~~~~~~~~~~~~~ You can use a lookup table to map values in your data to descriptive names. The ``_id`` property of your dimension contains the original value. .. code-block:: python # Convert M/F into Male/Female self.shelf['gender']: Dimension(Census.sex, lookup={'M': 'Male', 'F': 'Female'}, lookup_default='Unknown') If you use the gender dimension, there will be a ``gender_id`` in each row that will be "M" or "F" and a ``gender`` in each row that will be "Male" or "Female". .. code-block:: python shelf = Shelf({ 'state': Dimension(Census.state), 'gender_desc': Dimension(Census.gender, lookup={'M': 'Male', 'F': 'Female'}, lookup_default='Unknown'), 'age': WtdAvgMetric(Census.age, Census.pop2000), 'population': Metric(func.sum(Census.pop2000)) }) recipe = Recipe(shelf=shelf, session=oven.Session())\ .dimensions('gender_desc').metrics('population') print(recipe.to_sql()) print(recipe.dataset.csv) Lookups inject a formatter in the first position. Because a formatter is used, recipe creates a ``gender_desc_raw`` on the response that contains the unformatted value then uses the lookup to create the ``gender_desc`` property. All dimensions also generate an ``{ingredient}_id`` property. Here is the query and the results. .. code-block:: SELECT census.gender AS gender_desc_raw, sum(census.pop2000) AS population FROM census GROUP BY census.gender gender_desc_raw,population,gender_desc,gender_desc_id F,143534804,Female,F M,137392517,Male,M Metric ------ Metrics are aggregations performed on your data. Here's an example of a few Metrics. .. code-block:: python shelf = Shelf({ 'total_population': Metric(func.sum(Census.pop2000)), 'min_population': Metric(func.min(Census.pop2000)), 'max_population': Metric(func.max(Census.pop2000)) }) recipe = Recipe(shelf=shelf, session=oven.Session())\ .metrics('total_population', 'min_population', 'max_population') print(recipe.to_sql()) print(recipe.dataset.csv) The results of this recipe are: .. code:: SELECT max(census.pop2000) AS max_population, min(census.pop2000) AS min_population, sum(census.pop2000) AS total_population FROM census max_population,min_population,total_population 294583,217,280927321 DivideMetric ------------ Division in SQL introduces the possibility of division by zero. DivideMetric guards against division by zero while giving you a quick way to divide one calculation by another. .. code:: python shelf = Shelf({ 'state': Dimension(Census.state), 'popgrowth': DivideMetric(func.sum(Census.pop2008-Census.pop2000), func.sum(Census.pop2000)), }) recipe = Recipe(shelf=shelf, session=oven.Session())\ .dimensions('state').metrics('popgrowth') This creates results like: .. code:: SELECT census.state AS state, CAST(sum(census.pop2008 - census.pop2000) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS popgrowth FROM census GROUP BY census.state state,popgrowth,state_id Alabama,0.04749469366071285,Alabama Alaska,0.09194726152996757,Alaska Arizona,0.2598860676785905,Arizona Arkansas,0.06585681816651036,Arkansas California,0.0821639328251409,California Colorado,0.14231283526592364,Colorado ... The denominator has a tiny value added to it to prevent division by zero. WtdAvgMetric ------------ ``WtdAvgMetric`` generates a weighted average of a number using a weighting. .. warning:: ``WtdAvgMetric`` takes two ColumnElements as arguments. The first is the value and the second is the weighting. Unlike other Metrics, these are **not aggregated**. Here's an example. .. code:: python shelf = Shelf({ 'state': Dimension(Census.state), 'avgage': WtdAvgMetric(Census.age, Census.pop2000), }) recipe = Recipe(shelf=shelf, session=oven.Session())\ .dimensions('state').metrics('avgage') print(recipe.to_sql()) print(recipe.dataset.csv) This generates results that look like this: .. code:: SELECT census.state AS state, CAST(sum(census.age * census.pop2000) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS avgage FROM census GROUP BY census.state state,avgage,state_id Alabama,36.27787892421841,Alabama Alaska,31.947384766048568,Alaska Arizona,35.37065466080318,Arizona Arkansas,36.63745110262778,Arkansas California,34.17872597484759,California ... Note: WtdAvgMetric uses safe division from ``DivideMetric``. Filter ------ Filter objects add a condition to the where clause of your SQL query. Filter objects can be added to a Shelf. .. code:: python shelf = Shelf({ 'state': Dimension(Census.state), 'population': Metric(func.sum(Census.pop2000)), 'teens': Filter(Census.age.between(13,19)), }) recipe = Recipe(shelf=shelf, session=oven.Session())\ .dimensions('state')\ .metrics('population')\ .filters('teens') print(recipe.to_sql()) print(recipe.dataset.csv) This results in output like: .. code:: SELECT census.state AS state, sum(census.pop2000) AS population FROM census WHERE census.age BETWEEN 13 AND 19 GROUP BY census.state state,population,state_id Alabama,451765,Alabama Alaska,71655,Alaska Arizona,516270,Arizona Different ways of generating Filters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Recipe has several ways of filtering recipes. * **Filter objects can be added to the shelf**. They can be added to the recipe by name from a shelf. This is best when you have a filter that you want to use in many place. .. code:: python shelf = Shelf({ 'age': Dimension(Census.age), 'state': Dimension(Census.state), 'population': Metric(func.sum(Census.pop2000)), 'teens': Filter(Census.age.between(13,19)), }) ... recipe = recipe.filters('teens') * **Filter objects can be created dynamically** and added to the recipe. This is best if the filtering needs to change dynamically. .. code:: python recipe = recipe.filters(Filter(Census.age.between(13,19)) * **Ingredient.build_filter** can be used to build filters that refer to the ingredient's column. .. code:: python age_filter = shelf['age'].build_filter([13,19], 'between') recipe = recipe.filters(age_filter) This is best when you want to reuse a column definition defined in an ingredient. * **AutomaticFilters**: The AutomaticFilters extension adds filtering syntax directly to recipe. .. code:: python recipe = recipe.automatic_filters({ 'age__between': [13,19] }) This is best when you want to add many filters consistently. AutomaticFilters uses ``Ingredient.build_filter`` behind the scenes. Having ------ Having objects are binary expressions with an aggregated column value. One easy way to generate ``Having`` objects is to ``build_filter`` using a ``Metric``. .. code:: python shelf = Shelf({ 'age': Dimension(Census.age), 'avgage': WtdAvgMetric(Census.age, Census.pop2000), 'state': Dimension(Census.state), 'population': Metric(func.sum(Census.pop2000)), }) # Find states with a population greater than 15 million big_states = shelf['population'].build_filter(15000000, operator='gt') recipe = Recipe(shelf=shelf, session=oven.Session())\ .dimensions('state')\ .metrics('population')\ .order_by('-population')\ .filters(big_states) print(recipe.to_sql()) print(recipe.dataset.csv) This generates the following results. .. code:: SELECT census.state AS state, sum(census.pop2000) AS population FROM census GROUP BY census.state HAVING sum(census.pop2000) > 15000000 ORDER BY sum(census.pop2000) DESC state,population,state_id California,33829442,California Texas,20830810,Texas New York,18978668,New York Florida,15976093,Florida