SQLCell Part I: The Basics

11 minute read | Updated:

* [Description](#Description)
* [Installation](#Installation)
* [Buttons](#Buttons)
* [Declaring Engines](#Declaring-Engines)
* [Passing variables from Python to SQL](#Passing-variables-from-Python-to-SQL)
* [Pass output from SQL to Python](#Pass-output-from-SQL-to-Python)
* [Query plan as color coded table or graph](#Query-plan-as-color-coded-table-or-graph)
## Description
This project came about after I got tired of importing SQLAlchemy everytime I wanted to run a query. You should just be able to query your database directly from Jupyter, right? That's what this project accomplishes—along with many other fetaures that make it more like a SQL GUI than a Jupyter plugin.
First, we need to pull down a copy of the project from [here](https://github.com/tmthyjames/SQLCell).

Description

This project came about after I got tired of importing SQLAlchemy everytime I wanted to run a query. You should just be able to query your database directly from Jupyter, right? That's what this project accomplishes—along with many other fetaures that make it more like a SQL GUI than a Jupyter plugin.

First, we need to pull down a copy of the project from here.

## Installation

Installation

Just `git clone` the repo and `cp` the `sqlcell_app.py` file to Jupyter's startup directory (on my computer, the directory is `~/.ipython/profile_default/startup`, but may be different depending on your OS and version of IPython/Jupyter) like so:

Just git clone the repo and cp the sqlcell_app.py file to Jupyter's startup directory (on my computer, the directory is ~/.ipython/profile_default/startup, but may be different depending on your OS and version of IPython/Jupyter) like so:

$ cd .ipython/profile_default/startup # your start up directory here
$ git clone https://github.com/tmthyjames/SQLCell.git
$ cp SQLCell/sqlcell_app.py sqlcell_app.py # cp sqlcell_app.py to the startup folder so it will get executed
Now, just use the `ENGINE` parameter along with your connection string and give it a go.

Now, just use the ENGINE parameter along with your connection string and give it a go.

In [5]:
%%sql ENGINE='postgresql://tdobbins:tdobbins@localhost:5432/sports'
select * from nba limit 5

To execute: 0.01 sec | Rows: 5 | DB: sports | Host: localhost | READ MODE

dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-10-27DETATL106379620261229233623315515
22015-10-27ATLDET943782121582773322425915
32015-10-27CLECHI9538941017929113926721511
42015-10-27CHICLE9737871623719740131022613
52015-10-27NOGS953583192761882521326919
## Buttons

Buttons

As you can see, this outputs a few things. You get your data for one, but you also get access to a lot of other functionality. For instance, the button group on the left allows you to:
<br/>
&emsp;&emsp;<button title="Explain Analyze Graph" type="button" class="btn btn-info btn-sm button-sect"><p class="fa fa-code-fork fa-rotate-270"></p></button> Run `EXPLAIN ANALYZE` query and return D3.js sankey chart for analysis.
&emsp;&emsp;<button title="Explain Analyze" type="button" class="btn btn-info btn-sm button-sect"><p class="fa fa-info-circle"></p></button> Run `EXPLAIN ANALYZE` query and return color-coded table for analysis.
&emsp;&emsp;<button type="button" title="Execute" class="btn btn-success btn-sm button-sect"><p class="fa fa-play"></p></button> Execute your query.
&emsp;&emsp;<button type="button" title="Execute and Return Data as Variable" class="btn btn-success btn-sm button-sect"><p class="">var</p></button> Save the output of your query to a pandas dataframe.
&emsp;&emsp;<button title="Save" class="btn btn-success btn-sm button-sect" type="button"><p class="fa fa-save"></p></button> Save the results of your query to a TSV.
&emsp;&emsp;<button title="Cancel Query" class="button-sect btn btn-danger btn-sm" type="button"><p class="fa fa-stop"></p></button> Halt your query.

As you can see, this outputs a few things. You get your data for one, but you also get access to a lot of other functionality. For instance, the button group on the left allows you to:


   Run EXPLAIN ANALYZE query and return D3.js sankey chart for analysis.

   Run EXPLAIN ANALYZE query and return color-coded table for analysis.

   Execute your query.

   Save the output of your query to a pandas dataframe.

   Save the results of your query to a TSV.

   Halt your query.

The button on the right is an execute button. Instead of using `shift + enter` to run your query you can just press that button. This turns into a group of buttons when you declare more than one engine. More on this in the net section.

The button on the right is an execute button. Instead of using shift + enter to run your query you can just press that button. This turns into a group of buttons when you declare more than one engine. More on this in the net section.

## Declaring Engines

Declaring Engines

You can save the connection string with the `declare_engines` flag. This will allow you to run queries without specifying the `ENGINE` parameter as in our first query.

You can save the connection string with the declare_engines flag. This will allow you to run queries without specifying the ENGINE parameter as in our first query.

In [6]:
%%sql --declare_engines new
LOCAL=postgresql://tdobbins:tdobbins@localhost:5432/
new engines created
To declare multiple engines, you can use the `append` argument, or you can specify multiple in your initial engine declaration. Let's use the `append` argument here. I have two other postgres instances I will declare, one RDS and one EC2. The right side of the equation is what the button text will be; the left side is the connection string.

To declare multiple engines, you can use the append argument, or you can specify multiple in your initial engine declaration. Let's use the append argument here. I have two other postgres instances I will declare, one RDS and one EC2. The right side of the equation is what the button text will be; the left side is the connection string.

In [8]:
%%sql --declare_engines append
RDS=postgresql://username:password@orionpax.cvb2tnvqksy5.us-east-1.rds.amazonaws.com:5432/
EC2=postgresql://username:password@ec2-34-204-5-97.compute-1.amazonaws.com:5432/
new engines created
Now you can easily switch between different connection engines by pushing the engine's button.

Now you can easily switch between different connection engines by pushing the engine's button.

## Passing variables from Python to SQL

Passing variables from Python to SQL

This is my favorite part of SQLCell. The ability to pass variables from Python to SQL makes dynamic queries easier to write, which especially handy for web developers using a wrapper such as SQLAlchemy or something similar. Let's test it out.

This is my favorite part of SQLCell. The ability to pass variables from Python to SQL makes dynamic queries easier to write, which especially handy for web developers using a wrapper such as SQLAlchemy or something similar. Let's test it out.

In [10]:
# in this cell, define your python variables
some_int = 30
some_float = .3
some_str = 'GS'
some_tuple = ('WAS', 'MIN')
Now that we've defined our Python variables, we can construct a dynamic query.

Now that we've defined our Python variables, we can construct a dynamic query.

In [11]:
%%sql DB=sports
SELECT * 
FROM nba 
WHERE pts > %(some_int)s
    AND fg/fg_att::FLOAT > %(some_float)s
    AND team = %(some_str)s
    AND opp IN %(some_tuple)s

To execute: 0.027 sec | Rows: 6 | DB: sports | Host: localhost | READ MODE

dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-11-12GSMIN12948891516183862734725811
22016-02-03GSWAS1344993162920421138355241018
32016-03-21GSMIN1094387152482413313110201317
42016-03-29GSWAS102388715231125748276231116
52016-04-05GSMIN117499878123593735529924
62016-11-26GSMIN115458114171122123325814918
You can also use the colon syntax. Some may find this cleaner and more readable.

You can also use the colon syntax. Some may find this cleaner and more readable.

In [12]:
%%sql DB=sports
SELECT * 
FROM nba 
WHERE pts > :some_int
    AND fg/fg_att::FLOAT > :some_float
    AND team = :some_str
    AND opp IN :some_tuple

To execute: 0.014 sec | Rows: 6 | DB: sports | Host: localhost | READ MODE

dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-11-12GSMIN12948891516183862734725811
22016-02-03GSWAS1344993162920421138355241018
32016-03-21GSMIN1094387152482413313110201317
42016-03-29GSWAS102388715231125748276231116
52016-04-05GSMIN117499878123593735529924
62016-11-26GSMIN115458114171122123325814918
## Pass output from SQL to Python

Pass output from SQL to Python

To return the output as a Python variable, use the `var` button, or you can use the `MAKE_GLOBAL` parameter. Let's do it using the parameter.

To return the output as a Python variable, use the var button, or you can use the MAKE_GLOBAL parameter. Let's do it using the parameter.

In [13]:
%%sql DB=sports MAKE_GLOBAL=df
SELECT * 
FROM nba 
WHERE pts > :some_int
    AND fg/fg_att::FLOAT > :some_float
    AND team = :some_str
    AND opp IN :some_tuple

To execute: 0.013 sec | Rows: 6 | DB: sports | Host: localhost | READ MODE

dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-11-12GSMIN12948891516183862734725811
22016-02-03GSWAS1344993162920421138355241018
32016-03-21GSMIN1094387152482413313110201317
42016-03-29GSWAS102388715231125748276231116
52016-04-05GSMIN117499878123593735529924
62016-11-26GSMIN115458114171122123325814918
Now, `df` contains the dataframe that gets returned from the query.

Now, df contains the dataframe that gets returned from the query.

In [14]:
df
Out[14]:
dateof team opp pts fg fg_att ft ft_att fg3 fg3_att off_rebounds def_rebounds asst blks fouls stls turnovers
0 2015-11-12 GS MIN 129 48 89 15 16 18 38 6 27 34 7 25 8 11
1 2016-02-03 GS WAS 134 49 93 16 29 20 42 11 38 35 5 24 10 18
2 2016-03-21 GS MIN 109 43 87 15 24 8 24 13 31 31 10 20 13 17
3 2016-03-29 GS WAS 102 38 87 15 23 11 25 7 48 27 6 23 11 16
4 2016-04-05 GS MIN 117 49 98 7 8 12 35 9 37 35 5 29 9 24
5 2016-11-26 GS MIN 115 45 81 14 17 11 22 12 33 25 8 14 9 18
You can also pass the output of the query to python as the raw result set using the `RAW` parameter.

You can also pass the output of the query to python as the raw result set using the RAW parameter.

In [15]:
%%sql DB=sports MAKE_GLOBAL=df RAW=True
SELECT * 
FROM nba 
WHERE pts > :some_int
    AND fg/fg_att::FLOAT > :some_float
    AND team = :some_str
    AND opp IN :some_tuple

To execute: 0.014 sec | Rows: 6 | DB: sports | Host: localhost | READ MODE

dateofteamoppptsfgfg_attftft_attfg3fg3_attoff_reboundsdef_reboundsasstblksfoulsstlsturnovers
12015-11-12GSMIN12948891516183862734725811
22016-02-03GSWAS1344993162920421138355241018
32016-03-21GSMIN1094387152482413313110201317
42016-03-29GSWAS102388715231125748276231116
52016-04-05GSMIN117499878123593735529924
62016-11-26GSMIN115458114171122123325814918
Now, `df` is the raw result set returned by SQLAlchemy.

Now, df is the raw result set returned by SQLAlchemy.

In [16]:
df
Out[16]:
[(datetime.date(2015, 11, 12), u'GS', u'MIN', 129L, 48, 89, 15, 16, 18, 38, 6, 27, 34, 7, 25, 8, 11),
 (datetime.date(2016, 2, 3), u'GS', u'WAS', 134L, 49, 93, 16, 29, 20, 42, 11, 38, 35, 5, 24, 10, 18),
 (datetime.date(2016, 3, 21), u'GS', u'MIN', 109L, 43, 87, 15, 24, 8, 24, 13, 31, 31, 10, 20, 13, 17),
 (datetime.date(2016, 3, 29), u'GS', u'WAS', 102L, 38, 87, 15, 23, 11, 25, 7, 48, 27, 6, 23, 11, 16),
 (datetime.date(2016, 4, 5), u'GS', u'MIN', 117L, 49, 98, 7, 8, 12, 35, 9, 37, 35, 5, 29, 9, 24),
 (datetime.date(2016, 11, 26), u'GS', u'MIN', 115L, 45, 81, 14, 17, 11, 22, 12, 33, 25, 8, 14, 9, 18)]
## Query plan as color coded table or graph

Query plan as color coded table or graph

This feature helps to point out slow spots in your queries. First let's look at the query plan table. At the moment, it only works with PostgreSQL.

This feature helps to point out slow spots in your queries. First let's look at the query plan table. At the moment, it only works with PostgreSQL.

In [19]:
%%sql DB=econ
select average_family_size,
    average_home_value,
    average_household_income,
    average_household_size,
    county
from block_data bd
join block_shapes bs on bs.id = bd.block_code 

To execute: 0.011 sec | Rows: 8 | DB: econ | Host: localhost | READ MODE

QUERY PLAN
1Hash Join   (cost=33.60..89.81 rows=472 width=48) (actual time=0.499..0.977 rows=472 loops=1)
2   Hash Cond: (bs.id = bd.block_code)
3   ->   Seq Scan on block_shapes bs   (cost=0.00..49.72 rows=472 width=8) (actual time=0.007..0.182 rows=472 loops=1)
4   ->   Hash   (cost=27.71..27.71 rows=471 width=56) (actual time=0.470..0.470 rows=471 loops=1)
5            Buckets: 1024   Batches: 1   Memory Usage: 49kB
6            ->   Seq Scan on block_data bd   (cost=0.00..27.71 rows=471 width=56) (actual time=0.008..0.352 rows=471 loops=1)
7Planning time: 0.192 ms
8Execution time: 1.054 ms
The darker the color, the slower the node. Here, we can see that most of the time was used by a hash join that returned 472 rows. Let's look at the query plan graph, built with d3.js. It's a sankey graph that shows the hierarchy of the nodes and their respective cost in terms of tie and rows.

The darker the color, the slower the node. Here, we can see that most of the time was used by a hash join that returned 472 rows. Let's look at the query plan graph, built with d3.js. It's a sankey graph that shows the hierarchy of the nodes and their respective cost in terms of tie and rows.

In [1]:
%%sql DB=econ
SELECT *, 
    CASE 
        WHEN num_of_sb > 0 THEN true
        ELSE false
    end AS has_starbucks
FROM (
    SELECT
        ST_Area(geom::geography) / 1609.34^2 AS sq_miles,
        "HC03_VC96",
        "HC01_VC03",
        "HC01_VC103",
        "HC03_VC67",
        "HC03_VC68",
        "HC03_VC88",
        "HC03_VC49",
        "HC01_VC88",
        "HC03_VC135",
        "HC03_VC136",
        "HC03_VC92",
        z."HC03_VC12",
        z."HC03_VC41",
        "HC03_VC43",
        "HC03_VC161",
        "HC01_VC118",
        "HC03_VC87",
        z."HC03_VC171",
        zn.*,
        z.starbucks_count,
        sqrt("HC01_VC03") as "HC01_VC03_sqrt",
        count(sb."zipcode") over (
            partition by sb.zipcode
        ) as num_of_sb,
        sb."City",
        sb."State",
        sb."Latitude",
        sb."Longitude"
    from zip_data_sb z
    join zip_data2 z2 on z2."Id" = z."Id"
    join zip_data3 z3 on z3."Id" = z."Id"
    left join sb_stores sb on sb."zipcode" = z."Id2"
    join cb_2015_us_zcta510_500k c on c.geoid10::bigint = z."Id2"
    join (
        select zip,
            sum(case when naics ~ '^445' then est end)::bigint as "grocery_stores",
            sum(case when naics ~ '^722' then est end)::bigint as "restaurants",
            sum(case when naics ~ '^311' then est end)::bigint as "manufacturing",
            sum(case when naics ~ '^54171' or naics = '518210' then est end)::bigint as "tech",
            sum(case when naics ~ '^611' then est end)::bigint as "colleges",
            sum(case when naics = '------' then est end)::bigint as "total",
            sum(case when naics = '531120' then est end)::bigint as "shopping_centers",
            sum(case when naics ~ '^23611' then est end)::bigint as "new_homes",
            sum(case when naics ~ '^5311' then est end)::bigint as "hotels",
            sum(case when naics ~ '^4411' then est end)::bigint as "car_dealers",
            sum(case when naics ~ '^4853' then est end)::bigint as "taxis",
            sum(case when naics = '481111' 
                or naics in ('485999', '488119', '488190') then est end)::bigint as "airport_related"
        from zip_to_naics 
        group by zip
    ) zn on zn.zip = z."Id2"
) s

To execute: 48.948 sec | Rows: 1 | DB: econ | Host: localhost

WindowAgg -> Subquery Scan
Total Cost: 1,107,480
Child Rows: 41,540
Parent Rows: 41,540
Sort -> WindowAgg
Total Cost: 1,096,057
Child Rows: 41,540
Parent Rows: 41,540
Hash Join -> Sort
Total Cost: 984,574
Child Rows: 41,540
Parent Rows: 41,540
Hash Join -> Hash Join
Total Cost: 982,067
Child Rows: 19,912
Parent Rows: 41,540
Hash Join -> Hash Join
Total Cost: 978,540
Child Rows: 19,991
Parent Rows: 19,912
Hash Join -> Hash Join
Total Cost: 976,025
Child Rows: 19,991
Parent Rows: 19,991
Hash -> Hash Join
Total Cost: 940,227
Child Rows: 19,265
Parent Rows: 19,991
Hash Join -> Hash
Total Cost: 940,227
Child Rows: 19,265
Parent Rows: 19,265
Aggregate -> Hash Join
Total Cost: 931,911
Child Rows: 15,895
Parent Rows: 19,265
Index Scan -> Aggregate
Total Cost: 354,877
Child Rows: 8,377,036
Parent Rows: 15,895
Seq Scan -> Hash Join
Total Cost: 10,167
Child Rows: 33,144
Parent Rows: 19,991
Hash -> Hash Join
Total Cost: 5,196
Child Rows: 38,711
Parent Rows: 19,265
Seq Scan -> Hash
Total Cost: 5,196
Child Rows: 38,711
Parent Rows: 38,711
Hash -> Hash Join
Total Cost: 2,841
Child Rows: 32,989
Parent Rows: 19,912
Seq Scan -> Hash
Total Cost: 2,841
Child Rows: 32,989
Parent Rows: 32,989
Hash -> Hash Join
Total Cost: 1,825
Child Rows: 33,120
Parent Rows: 19,991
Seq Scan -> Hash
Total Cost: 1,825
Child Rows: 33,120
Parent Rows: 33,120
Hash -> Hash Join
Total Cost: 662
Child Rows: 10,843
Parent Rows: 41,540
Seq Scan -> Hash
Total Cost: 662
Child Rows: 10,843
Parent Rows: 10,843

Cost: 1,107,480
Time: 37186.693...48884.531
Rows: 41,540
Subquery Scan

Cost: 1,107,480
Time: 37186.69...48863.208
Rows: 41,540
WindowAgg

Cost: 1,096,057
Time: 37114.49...37185.289
Rows: 41,540
Sort
((z."Id2")::double precision = sb.zipcode)
Cost: 984,574
Time: 36324.114...36716.638
Rows: 41,540
Hash Join
(z2."Id" = z3."Id")
Cost: 982,067
Time: 36309.334...36671.244
Rows: 19,912
Hash Join
(z."Id" = z2."Id")
Cost: 978,540
Time: 36147.604...36481.454
Rows: 19,991
Hash Join
((c.geoid10)::bigint = z."Id2")
Cost: 976,025
Time: 36090.894...36396.367
Rows: 19,991
Hash Join
using cb_2015_us_zcta510_500k c
Cost: 10,167
Time: 0.356...112.623
Rows: 33,144
Seq Scan

Cost: 940,227
Time: 36090.162...36090.162
Rows: 19,265
Hash
(zip_to_naics.zip = z."Id2")
Cost: 940,227
Time: 95.437...36059.181
Rows: 19,265
Hash Join
[u'zip_to_naics.zip']
Cost: 931,911
Time: 1.863...35918.065
Rows: 15,895
Aggregate
using ix_zip zip_to_naics
Cost: 354,877
Time: 1.424...2681.767
Rows: 8,377,036
Index Scan

Cost: 5,196
Time: 89.042...89.042
Rows: 38,711
Hash
using zip_data_sb z
Cost: 5,196
Time: 0.501...72.527
Rows: 38,711
Seq Scan

Cost: 1,825
Time: 56.46...56.46
Rows: 33,120
Hash
using zip_data2 z2
Cost: 1,825
Time: 0.015...46.002
Rows: 33,120
Seq Scan

Cost: 2,841
Time: 161.438...161.438
Rows: 32,989
Hash
using zip_data3 z3
Cost: 2,841
Time: 0.009...148.4
Rows: 32,989
Seq Scan

Cost: 662
Time: 14.695...14.695
Rows: 10,843
Hash
using sb_stores sb
Cost: 662
Time: 0.358...12.106
Rows: 10,843
Seq Scan