Exposing a Word2Vec Model with a RESTful API Using Only a Jupyter Notebook (No Web-Development Skills Required!)

9 minute read | Updated:

* [Installation and Configuration](#Installation-and-Configuration)
* [Define the API](#Define-the-API)
* [Startup the API](#Startup-the-API)
* [Test the API](#Test-the-API)
* [Shutdown the API](#Shutdown-the-API)
In this post, we'll use a Jupyter notebook as a backend RESTful service to expose a Word2Vec model we trained previously in my write up on <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">Analyzing Rap Lyrics Using Word Vectors</a>. Typically, when building a RESTful API to expose a model, I'd use <a href="https://flask-restful.readthedocs.io/en/latest/">Flask-RESTful</a> or a paid service like <a href="https://www.alteryx.com/products/alteryx-platform/alteryx-promote">Alteryx Promote</a>. The former is great if you have web development experience, and the latter is great if you have money to spend. For those who don't want to buy a license or learn web development skills, you can just use Jupyter!
I love Jupyter so much that I go out of my way to do tasks in Jupyter that I wouldn't normally do in Jupyter, which is why I built <a href="https://github.com/tmthyjames/SQLCell">SQLCell</a>—because I absolutely hated using pgAdmin or any other SQL client.
Admittedly, after reading <a href="http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/">this post</a> a while back, I tried to use Jupyter as a RESTful service but I just couldn't get it to work correctly. But recently at work, I needed to expose an LDA topic model via a RESTful API and thought I'd give it another try. So let's get started.

In this post, we'll use a Jupyter notebook as a backend RESTful service to expose a Word2Vec model we trained previously in my write up on Analyzing Rap Lyrics Using Word Vectors. Typically, when building a RESTful API to expose a model, I'd use Flask-RESTful or a paid service like Alteryx Promote. The former is great if you have web development experience, and the latter is great if you have money to spend. For those who don't want to buy a license or learn web development skills, you can just use Jupyter!

I love Jupyter so much that I go out of my way to do tasks in Jupyter that I wouldn't normally do in Jupyter, which is why I built SQLCell—because I absolutely hated using pgAdmin or any other SQL client.

Admittedly, after reading this post a while back, I tried to use Jupyter as a RESTful service but I just couldn't get it to work correctly. But recently at work, I needed to expose an LDA topic model via a RESTful API and thought I'd give it another try. So let's get started.

xxxxxxxxxx
## Installation and Configuration

Installation and Configuration

First, you'll need to `pip3 install` the `jupyter_kernel_gateway` package:

First, you'll need to pip3 install the jupyter_kernel_gateway package:

pip3 install jupyter_kernel_gateway
Then generate the config file:

Then generate the config file:

jupyter kernelgateway --generate-config
xxxxxxxxxx
If you want to access this service on another computer, then you'll need to open the kernel's config file (`~/.jupyter/jupyter_kernel_gateway_config.py`) and change this line:

If you want to access this service on another computer, then you'll need to open the kernel's config file (~/.jupyter/jupyter_kernel_gateway_config.py) and change this line:

#c.KernelGatewayApp.ip = '127.0.0.1'
to 

to

c.KernelGatewayApp.ip = '*'
xxxxxxxxxx
Hopefully, everything went smoothly for you. I'm on an Ubuntu system, but if you are on Windows or Mac I'm sure you'll run into errors. 
Next we'll define our RESTful API endpoints.

Hopefully, everything went smoothly for you. I'm on an Ubuntu system, but if you are on Windows or Mac I'm sure you'll run into errors.

Next we'll define our RESTful API endpoints.

## Define the API

Define the API

xxxxxxxxxx
First, define any variables that you want to be global to all endpoints (import statements, models, etc) as you normally would in your Jupyter notebook. In our example, we have a Doc2Vec model that we trained in a <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">previous post</a> that we'll use to score new queries and return a similarity score.

First, define any variables that you want to be global to all endpoints (import statements, models, etc) as you normally would in your Jupyter notebook. In our example, we have a Doc2Vec model that we trained in a previous post that we'll use to score new queries and return a similarity score.

In [67]:
import json
import gensim
model = gensim.models.Word2Vec.load('rap-lyrics.doc2vec')
So here's the magic. When you start up this service, the kernel gateway will look for a hash (`#`) followed by an HTTP verb then, lastly, an endpoint (e.g. `/some-dummy-endpoint`). In the following example, the kernel gateway will create an endpoint called `/most_similar_terms` that accepts `GET` requests. You can also see we're allowing the option for the user to send data using the URL parameters, `query` and `topn`. `query` expects a word to measure similarity against, and `topn` tells our endpoint how many results we want back.

So here's the magic. When you start up this service, the kernel gateway will look for a hash (#) followed by an HTTP verb then, lastly, an endpoint (e.g. /some-dummy-endpoint). In the following example, the kernel gateway will create an endpoint called /most_similar_terms that accepts GET requests. You can also see we're allowing the option for the user to send data using the URL parameters, query and topn. query expects a word to measure similarity against, and topn tells our endpoint how many results we want back.

In [ ]:
# GET /most_similar_terms
req = json.loads(REQUEST)
args = req['args']
topn = 10 if 'topn' not in args else int(args['topn'][0])
if 'query' in args:
    query = args['query']
    print(json.dumps({'results': model.most_similar(query, topn=topn)}))
else:
    print(json.dumps({'results': None}))
xxxxxxxxxx
The `REQUEST` variable is the HTTP request data. It contains all the information that we'll send to our endpoint, as well as other data (such as headers). If you run this, it will error out as `REQUEST` is not defined. The kernel gateway defines this variable when you send a request to the endpoint after the service has been started.
Let's create another endpoint that returns the most similar songs given a query word:

The REQUEST variable is the HTTP request data. It contains all the information that we'll send to our endpoint, as well as other data (such as headers). If you run this, it will error out as REQUEST is not defined. The kernel gateway defines this variable when you send a request to the endpoint after the service has been started.

Let's create another endpoint that returns the most similar songs given a query word:

In [ ]:
# GET /most_similar_songs
req = json.loads(REQUEST)
args = req['args']
topn = 10 if 'topn' not in args else int(args['topn'][0])
if 'query' in args:
    query = args['query'][0]
    print(json.dumps({'results': model.docvecs.most_similar([model[query]], topn=topn)}))
else:
    print(json.dumps({'results': None}))
xxxxxxxxxx
And lastly, let's create a final endpoint using path variables instead of URL parameters to specify `query` and `topn` that accepts `POST`s requests.

And lastly, let's create a final endpoint using path variables instead of URL parameters to specify query and topn that accepts POST requests.

In [ ]:
# POST /most_similar_songs/:query/:topn
req = json.loads(REQUEST)
body = req['body']
query = req['path']['query']
topn = int(req['path']['topn'] or 10)
print(json.dumps({'results': model.docvecs.most_similar([model[query]], topn=topn)}))
## Startup the API

Startup the API

xxxxxxxxxx
Now, normally you'd head over to the command line, `cd` into your working directory, then run the following command to start the service, where `KernelGatewayApp.seed_uri` is just the path to your notebook:

Now, normally you'd head over to the command line, cd into your working directory, then run the following command to start the service, where KernelGatewayApp.seed_uri is just the path to your notebook:

jupyter kernelgateway \
    --KernelGatewayApp.api='kernel_gateway.notebook_http' \
    --KernelGatewayApp.seed_uri='/home/ubuntu/projects/APIs/word2vec-restful-api.ipynb' \
    --port 8989
xxxxxxxxxx
No problem. It's just as easy to do that, but as I stated before, I like doing as much as possible in Jupyter, and given the flexibility of Jupyter, virtually anything is possible. So let's startup our notebook service in the background (by appending `&` to the end of the command). It's important to send this task to the background or else the cell won't ever quit running until you interrupt it. 

No problem. It's just as easy to do that, but as I stated before, I like doing as much as possible in Jupyter, and given the flexibility of Jupyter, virtually anything is possible. So let's startup our notebook service in the background (by appending & to the end of the command). It's important to send this task to the background or else the cell won't ever quit running until you interrupt it.

In [108]:
# PUT /STARTUP-DO-NOT-HIT
import os
os.system('''jupyter kernelgateway \
    --KernelGatewayApp.api='kernel_gateway.notebook_http' \
    --KernelGatewayApp.seed_uri='/home/ubuntu/projects/APIs/word2vec-restful-api.ipynb' \
    --port 8989 &''') # use & to send task to background
Out[108]:
0
You probably noticed that we just defined the endpoint `/STARTUP-DO-NOT-HIT` that accepts `PUT` requests. That is true, unfortunately. When the kernel service starts, it executes every code cell, including your startup, shutdown, or testing cells, which causes errors. Defining a dummy endpoint that I'll never actually call was the best way to ensure I could run everything in the notebook without having to start/stop the service from the command line. I think running everything in the notebook is more convenient, but I don't like having to create a dummy endpoint so if you know a better work around please let me know.

You probably noticed that we just defined the endpoint /STARTUP-DO-NOT-HIT that accepts PUT requests. That is true, unfortunately. When the kernel service starts, it executes every code cell, including your startup, shutdown, or testing cells, which causes errors. Defining a dummy endpoint that I'll never actually call was the best way to ensure I could run everything in the notebook without having to start/stop the service from the command line. I think running everything in the notebook is more convenient, but I don't like having to create a dummy endpoint so if you know a better work around please let me know.

## Test the API

Test the API

xxxxxxxxxx
Now, let's test our API by using the `requests` package and making requests to our endpoints. 

Now, let's test our API by using the requests package and making requests to our endpoints.

In [120]:
# define a DRY function
def call_word2vec(endpoint, query, topn):
    r = requests.get(
        'http://ec2-18-207-173-217.compute-1.amazonaws.com:8989/'+endpoint+'?query='+query+'&topn='+str(topn)
    )
    return json.loads(r.text)
xxxxxxxxxx
Here, we call the `/most_similar_terms` endpoint and return the top 10 words that are most similar to the word "wax":

Here, we call the /most_similar_terms endpoint and return the top 10 words that are most similar to the word "wax":

In [33]:
# PUT /test
call_word2vec('most_similar_terms', 'wax', 10)
Out[33]:
{'results': [['rappin', 0.32639312744140625],
  ['track', 0.2756131887435913],
  ['rapper', 0.27285802364349365],
  ['snap', 0.2668995261192322],
  ['adapt', 0.2646450996398926],
  ['rolleys', 0.26291176676750183],
  ['peeler', 0.2610893249511719],
  ['crack', 0.26083022356033325],
  ['black', 0.2604678273200989],
  ['gat', 0.25876176357269287],
  ['rat', 0.2560899555683136]]}
xxxxxxxxxx
Here, we call the /most_similar_songs endpoint and return the top 10 songs that are most similar to the word "eminem":

Here, we call the /most_similar_songs endpoint and return the top 10 songs that are most similar to the word "eminem":

In [34]:
# PUT /test
call_word2vec('most_similar_songs', 'eminem', 10)
Out[34]:
{'results': [['Eminem|2363', 0.26885756850242615],
  ['Eminem|2995', 0.2669536769390106],
  ['D12|5191', 0.25570592284202576],
  ['Eminem|4189', 0.2381049543619156],
  ['D12|3477', 0.2308967411518097],
  ['D12|3481', 0.2302926629781723],
  ['D12|5186', 0.2270514965057373],
  ['Eminem|7985', 0.2205711007118225],
  ['Eminem|5749', 0.21902979910373688],
  ['Fat_Joe|5770', 0.21522411704063416]]}
xxxxxxxxxx
Here, we do the same thing, but instead of using URL parameters, we specify those arguments in our URL path:

Here, we do the same thing, but instead of using URL parameters, we specify those arguments in our URL path:

In [35]:
# PUT /test
r = requests.post(
    'http://ec2-18-207-173-217.compute-1.amazonaws.com:8989/most_similar_songs/rap/10'
)
json.loads(r.text)
Out[35]:
{'results': [['B.o.B|7896', 0.2209690660238266],
  ['Young_Jeezy|7836', 0.21050706505775452],
  ['KRS-One|7474', 0.20887047052383423],
  ['Fat_Joe|3611', 0.2056863009929657],
  ['Gang_Starr|2048', 0.19296427071094513],
  ['Tech_N9ne|4517', 0.19260960817337036],
  ['B.o.B|8583', 0.19166532158851624],
  ['De_La_Soul|2944', 0.19135785102844238],
  ['Eminem|4190', 0.19095411896705627],
  ['Gang_Starr|4774', 0.18431484699249268]]}
## Shutdown the API

Shutdown the API

xxxxxxxxxx
And finally, we shutdown the API. Note that if you make changes to your endpoints you'll need to restart the service for those changes to register.

And finally, we shutdown the API. Note that if you make changes to your endpoints you'll need to restart the service for those changes to register.

In [123]:
# PUT /SHUTDOWN-DO-NOT-HIT
os.system('pkill -f "jupyter-kernelg"')
Out[123]:
15