Senior Software Developer and Linux Fanatic
How to create an api for cleaning bad content and use NLP
Natural Language Processing (NLP) is a complex field that involves processing large amounts of data to understand human language. Hugging Face is a popular open-source platform that provides a wide range of NLP models and tools. In this article, we will learn how to use a transformers model from Hugging Face and FastAPI to create a stop words system that can be used in any other types of software.
Step 1: Installation
Before we begin, we need to install transformers and FastAPI. Open the command prompt and run the following commands:
pip install transformers pip install fastapi
Step 2: Importing the transformers model
Once we have installed transformers and FastAPI, we can start by importing the transformers model. In this example, we will use the DistilBERT model, which is a smaller and faster version of the popular BERT model.
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") nlp = pipeline("ner", model=model, tokenizer=tokenizer)
In this code snippet, we import the pipeline module from transformers and load the DistilBERT model and tokenizer. We then create an instance of the pipeline module with the “ner” task, which stands for named entity recognition. This task is useful for identifying stop words since they are typically common words that do not carry much meaning.
Step 3: Creating a FastAPI endpoint for the stop words system
Now that we have loaded the transformers model, we can use FastAPI to create an API endpoint for the stop words system. In this example, we will create an endpoint that takes a list of words as input and returns the same list with stop words removed.
from typing import List from fastapi import FastAPI app = FastAPI() @app.post('/remove_stop_words') def remove_stop_words(words: List[str]): cleaned_words =  for word in words: results = nlp(word) label = results['entity'] if label != "O": cleaned_words.append(word) return cleaned_words
In this code snippet, we define a FastAPI instance and define a POST endpoint called “remove_stop_words.” This endpoint takes a list of words as input and returns the same list with stop words removed. We loop through each word in the input list and pass it through the transformers pipeline with the named entity recognition task. We check if the label returned by the pipeline is not “O,” which indicates that the word is a stop word. If the label is not “O,” we append the word to a new list called “cleaned_words.” Finally, we return the “cleaned_words” list as the output of the endpoint.
Step 4: Testing the endpoint
We can now test the endpoint using any HTTP client, such as cURL or Postman. In this example, we will use the requests module to send a POST request to the endpoint.
import requests input_words = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] response = requests.post('http://localhost:8000/remove_stop_words', json=input_words) print(response.json())
In this code snippet, we define a list of input words and send a POST request to the endpoint. We pass the input words as a JSON payload to the request. The endpoint removes the stop words from the input list and returns the cleaned list as a JSON response. Finally, we print the response to the console.
In the previous section, we learned how to use a transformers model from Hugging Face and FastAPI to create a stop words system that can be used in any other types of software. In this section, we will learn how to containerize the application using Docker.
Step 1: Create a Dockerfile
The first step in containerizing the application is to create a Dockerfile. The Dockerfile is a script that defines the environment and configuration of the container. It includes instructions for building the container image and running the application inside the container.
# Use an official Python runtime as a parent image FROM python:3.9 # Set the working directory to /app WORKDIR /app # Copy the requirements file into the container and install the dependencies COPY requirements.txt /app RUN pip install --no-cache-dir -r requirements.txt # Copy the rest of the application code into the container COPY . /app # Expose the port on which the application will listen EXPOSE 8000 # Start the application inside the container CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
In this Dockerfile, we use an official Python runtime as the base image. We set the working directory to “/app” and copy the requirements file into the container. We then install the dependencies using pip. Next, we copy the rest of the application code into the container. We expose port 8000 on the container and set the command to start the application using the Uvicorn server.
Step 2: Build the Docker image
Once we have created the Dockerfile, we can use the “docker build” command to build the Docker image. Open the command prompt and navigate to the directory containing the Dockerfile and run the following command:
docker build -t stop-words .
In this command, we specify the name of the Docker image using the “-t” flag. The “.” at the end specifies that the build context is the current directory.
Step 3: Run the Docker container
Once we have built the Docker image, we can use the “docker run” command to run the Docker container. We specify the port mapping using the “-p” flag and the name of the Docker image using the “–name” flag. We can also use the “-d” flag to run the container in detached mode.
docker run -p 8000:8000 --name stop-words -d stop-words
In this command, we map port 8000 on the host machine to port 8000 inside the container using the “-p” flag. We specify the name of the container using the “–name” flag and the name of the Docker image using the “-d” flag.