Exposing a Word2Vec Model with a RESTful API Using Only a Jupyter Notebook (No Web-Development Skills Required!)
* [Installation and Configuration](#Installation-and-Configuration)
* [Define the ](#Define-the-)
* [Startup the ](#Startup-the-)
* [Test the ](#Test-the-)
* [Shutdown the ](#Shutdown-the-)
In this post, we'll use a Jupyter notebook as a backend RESTful service to expose a Word2Vec model we trained previously in my write up on <a href="https://flask-restful.readthedocs.io/en/latest/">Flask-RESTful</a> or a paid service like <a href="https://www.alteryx.com/products/alteryx-platform/alteryx-promote">Alteryx Promote</a>. The former is great if you have web development experience, and the latter is great if you have money to spend. For those who don't want to buy a license or learn web development skills, you can just use Jupyter! . Typically, when building a RESTful API to expose a model, I'd use
I love Jupyter so much that I go out of my way to do tasks in Jupyter that I wouldn't normally do in Jupyter, which is why I built <a href="https://github.com/tmthyjames/SQLCell">SQLCell</a>—because I absolutely hated using pgAdmin or any other SQL client.
Admittedly, after reading <a href="http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/">this post</a> a while back, I tried to use Jupyter as a RESTful service but I just couldn't get it to work correctly. But recently at work, I needed to expose an LDA topic model via a RESTful API and thought I'd give it another try. So let's get started.
In this post, we'll use a Jupyter notebook as a backend RESTful service to expose a Word2Vec model we trained previously in my write up on Analyzing Rap Lyrics Using Word Vectors. Typically, when building a RESTful API to expose a model, I'd use Flask-RESTful or a paid service like Alteryx Promote. The former is great if you have web development experience, and the latter is great if you have money to spend. For those who don't want to buy a license or learn web development skills, you can just use Jupyter!
I love Jupyter so much that I go out of my way to do tasks in Jupyter that I wouldn't normally do in Jupyter, which is why I built SQLCell—because I absolutely hated using pgAdmin or any other SQL client.
Admittedly, after reading this post a while back, I tried to use Jupyter as a RESTful service but I just couldn't get it to work correctly. But recently at work, I needed to expose an LDA topic model via a RESTful API and thought I'd give it another try. So let's get started.
xxxxxxxxxx
## Installation and Configuration
Installation and Configuration¶
`pip3 install` the `jupyter_kernel_gateway` package: , you'll need to
First, you'll need to pip3 install
the jupyter_kernel_gateway
package:
install jupyter_kernel_gateway
Then the config file:
Then generate the config file:
kernelgateway --generate-config
xxxxxxxxxx
If you want to access this service on another computer, then you'll need to open the kernel's config file (`~/.jupyter/jupyter_kernel_gateway_config.py`) and change this :
If you want to access this service on another computer, then you'll need to open the kernel's config file (~/.jupyter/jupyter_kernel_gateway_config.py
) and change this line:
#c.KernelGatewayApp.ip = '127.0.0.1'
to
.KernelGatewayApp.ip = '*'
xxxxxxxxxx
Hopefully, everything went smoothly for you. I'm on an Ubuntu system, but if you are on Windows or Mac I'm sure you'll run into errors.
Next we'll define our RESTful API endpoints.
Hopefully, everything went smoothly for you. I'm on an Ubuntu system, but if you are on Windows or Mac I'm sure you'll run into errors.
Next we'll define our RESTful API endpoints.
## Define the API
Define the API¶
xxxxxxxxxx
First, define any variables that you want to be global to all endpoints (import statements, models, etc) as you normally would in your Jupyter notebook. In our example, we have a Doc2Vec model that we trained in a <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">previous post</a> that we'll use to score new queries and return a similarity score.
First, define any variables that you want to be global to all endpoints (import statements, models, etc) as you normally would in your Jupyter notebook. In our example, we have a Doc2Vec model that we trained in a previous post that we'll use to score new queries and return a similarity score.
json
gensim
model = gensim.models.Word2Vec.load('rap-lyrics.doc2vec')
So here's the magic. When you start up this service, the kernel gateway will look for a hash (`#`) followed by an HTTP verb then, lastly, an endpoint (e.g. `/some-dummy-endpoint`). In the following example, the kernel gateway will create an endpoint called `/most_similar_terms` that accepts `GET` requests. You can also see we're allowing the option for the user to send data using the URL parameters, `query` and `topn`. `query` expects a word to measure similarity against, and `topn` tells our endpoint how many results we want back.
So here's the magic. When you start up this service, the kernel gateway will look for a hash (#
) followed by an HTTP verb then, lastly, an endpoint (e.g. /some-dummy-endpoint
). In the following example, the kernel gateway will create an endpoint called /most_similar_terms
that accepts GET
requests. You can also see we're allowing the option for the user to send data using the URL parameters, query
and topn
. query
expects a word to measure similarity against, and topn
tells our endpoint how many results we want back.
# GET /most_similar_terms
req = json.loads(REQUEST)
args = req['args']
topn = 10 if 'topn' not in args else int(args['topn'][0])
if 'query' in args:
query = args['query']
print(json.dumps({'results': model.most_similar(query, topn=topn)}))
else:
print(json.dumps({'results': None}))
xxxxxxxxxx
The `REQUEST` variable is the HTTP request data. It contains all the information that we'll send to our endpoint, as well as other data (such as headers). If you run this, it will error out as `REQUEST` is not defined. The kernel gateway defines this variable when you send a request to the endpoint after the service has been started.
Let's create another endpoint that returns the most similar songs given a query word:
The REQUEST
variable is the HTTP request data. It contains all the information that we'll send to our endpoint, as well as other data (such as headers). If you run this, it will error out as REQUEST
is not defined. The kernel gateway defines this variable when you send a request to the endpoint after the service has been started.
Let's create another endpoint that returns the most similar songs given a query word:
# GET /most_similar_songs
req = json.loads(REQUEST)
args = req['args']
topn = 10 if 'topn' not in args else int(args['topn'][0])
if 'query' in args:
query = args['query'][0]
print(json.dumps({'results': model.docvecs.most_similar([model[query]], topn=topn)}))
else:
print(json.dumps({'results': None}))
xxxxxxxxxx
And lastly, let's create a final endpoint using path variables instead of URL parameters to specify `query` and `topn` that accepts `POST`s requests.
And lastly, let's create a final endpoint using path variables instead of URL parameters to specify query
and topn
that accepts POST
requests.
# POST /most_similar_songs/:query/:topn
req = json.loads(REQUEST)
body = req['body']
query = req['path']['query']
topn = int(req['path']['topn'] or 10)
print(json.dumps({'results': model.docvecs.most_similar([model[query]], topn=topn)}))
##
Startup the API¶
xxxxxxxxxx
Now, normally you'd head over to the command line, `cd` into your working directory, then run the following command to start the service, where `KernelGatewayApp.seed_uri` is just the path to your :
Now, normally you'd head over to the command line, cd
into your working directory, then run the following command to start the service, where KernelGatewayApp.seed_uri
is just the path to your notebook:
kernelgateway \
--KernelGatewayApp.api='kernel_gateway.notebook_http' \
--KernelGatewayApp.seed_uri='/home/ubuntu/projects/APIs/word2vec-restful-api.ipynb' \
--port 8989
xxxxxxxxxx
No problem. It's just as easy to do that, `&` to the end of the command). It's important to send this task to the background or else the cell won't ever quit running until you interrupt it. as I stated before, I like doing as much as possible in Jupyter, and given the flexibility of Jupyter, virtually anything is possible. So let's startup our notebook service in the background (by appending
No problem. It's just as easy to do that, but as I stated before, I like doing as much as possible in Jupyter, and given the flexibility of Jupyter, virtually anything is possible. So let's startup our notebook service in the background (by appending &
to the end of the command). It's important to send this task to the background or else the cell won't ever quit running until you interrupt it.
# PUT /STARTUP-DO-NOT-HIT
import os
os.system('''jupyter kernelgateway \
--KernelGatewayApp.api='kernel_gateway.notebook_http' \
--KernelGatewayApp.seed_uri='/home/ubuntu/projects/APIs/word2vec-restful-api.ipynb' \
--port 8989 &''') # use & to send task to
You probably noticed that we just defined the endpoint `/STARTUP-DO-NOT-HIT` that accepts `PUT` requests. That is true, unfortunately. When the kernel service starts, it executes every code cell, including your startup, shutdown, or testing cells, which causes errors. Defining a dummy endpoint that I'll never actually call was the best way to ensure I could run everything in the notebook without having to start/stop the service from the command line. I think running everything in the notebook is more convenient, but I don't like having to create a dummy endpoint so if you know a better work around please let me know.
You probably noticed that we just defined the endpoint /STARTUP-DO-NOT-HIT
that accepts PUT
requests. That is true, unfortunately. When the kernel service starts, it executes every code cell, including your startup, shutdown, or testing cells, which causes errors. Defining a dummy endpoint that I'll never actually call was the best way to ensure I could run everything in the notebook without having to start/stop the service from the command line. I think running everything in the notebook is more convenient, but I don't like having to create a dummy endpoint so if you know a better work around please let me know.
## Test the API
Test the API¶
xxxxxxxxxx
Now, let's test our API by using the `` package and making to our endpoints.
Now, let's test our API by using the requests
package and making requests to our endpoints.
# define a DRY function
def call_word2vec(endpoint, query, topn):
r = requests.get(
'http://ec2-18-207-173-217.compute-1.amazonaws.com:8989/'+endpoint+'?query='+query+'&topn='+str(topn)
)
return json.loads(r.text)
xxxxxxxxxx
Here, we call the `/most_similar_terms` endpoint and return the top 10 words that are most similar to the word "wax":
Here, we call the /most_similar_terms
endpoint and return the top 10 words that are most similar to the word "wax":
# PUT /test
call_word2vec('most_similar_terms', 'wax', 10)
xxxxxxxxxx
Here, we call the /most_similar_songs endpoint and return the top 10 songs that are most similar to the word " ":
Here, we call the /most_similar_songs endpoint and return the top 10 songs that are most similar to the word "eminem":
# PUT /test
call_word2vec('most_similar_songs', 'eminem', 10)
xxxxxxxxxx
Here, we do the same thing, but instead of using URL parameters, we specify those arguments in our URL path:
Here, we do the same thing, but instead of using URL parameters, we specify those arguments in our URL path:
# PUT /test
r = requests.post(
'http://ec2-18-207-173-217.compute-1.amazonaws.com:8989/most_similar_songs/rap/10'
)
json.loads(r.text)
## Shutdown the API
Shutdown the API¶
xxxxxxxxxx
And finally, we shutdown the API. Note that if you make changes to your endpoints you'll need to restart the service for those changes to register.
And finally, we shutdown the API. Note that if you make changes to your endpoints you'll need to restart the service for those changes to register.
# PUT /SHUTDOWN-DO-NOT-HIT
os.system('pkill -f "jupyter-kernelg"')