Using a chat bot

After all the resources have been configured within the Ray Serve cluster, it's now time to directly perform inference through the Mistral-7B-Instruct-v0.3 model with a chat bot. The web interface is powered by the Gradio package.

We'll deploy the application with the following Kubernetes resources:

~/environment/eks-workshop/modules/aiml/chatbot/gradio-mistral/gradio-ui.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: gradio-mistral-trn1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio-deployment
  namespace: gradio-mistral-trn1
  labels:
    app: gradio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
        - name: gradio
          image: public.ecr.aws/aws-containers/eks-workshop/gradio-web-app-base:0.1.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 7860
          resources:
            requests:
              cpu: "512m"
              memory: "4096Mi"
            limits:
              cpu: "1"
              memory: "4096Mi"
          env:
            - name: MODEL_ENDPOINT
              value: "/infer"
            - name: SERVICE_NAME
              value: "http://mistral-serve-svc.mistral.svc.cluster.local:8000"
          volumeMounts:
            - name: gradio-app-script
              mountPath: /app/gradio-app.py
              subPath: gradio-app-mistral-tran1.py
      volumes:
        - name: gradio-app-script
          configMap:
            name: gradio-app-script
---
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
  namespace: gradio-mistral-trn1
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
  selector:
    app: gradio
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 7860
  type: LoadBalancer
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gradio-app-script
  namespace: gradio-mistral-trn1
data:
  gradio-app-mistral-tran1.py: |
    import gradio as gr
    import requests
    import os

    # Constants for model endpoint and service name
    model_endpoint = "/infer"
    service_name = os.environ.get("SERVICE_NAME", "http://localhost:8000")

    # Function to generate text
    def text_generation(message, history):
        prompt = message

        # Create the URL for the inference
        url = f"{service_name}{model_endpoint}"

        try:
            # Send the request to the model service
            response = requests.get(url, params={"sentence": prompt}, timeout=180)
            response.raise_for_status()  # Raise an exception for HTTP errors

            full_output = response.json()[0]
            # Removing the original question from the output
            answer_only = full_output.replace(prompt, "", 1).strip('["]?\n')

            # Safety filter to remove harmful or inappropriate content
            answer_only = filter_harmful_content(answer_only)
            return answer_only
        except requests.exceptions.RequestException as e:
            # Handle any request exceptions (e.g., connection errors)
            return f"AI: Error: {str(e)}"

    # Define the safety filter function (you can implement this as needed)
    def filter_harmful_content(text):
        # TODO: Implement a safety filter to remove any harmful or inappropriate content from the text

        # For now, simply return the text as-is
        return text

    # Define the Gradio ChatInterface
    chat_interface = gr.ChatInterface(
        text_generation,
        chatbot=gr.Chatbot(line_breaks=True),
        textbox=gr.Textbox(placeholder="Ask me a question", container=False, scale=7),
        title="neuron-mistral7bv0.3 AI Chat",
        description="Ask me any question",
        theme="soft",
        examples=["How many languages are in India", "What is Generative AI?"],
        cache_examples=False,
        retry_btn=None,
        undo_btn="Delete Previous",
        clear_btn="Clear",
    )

    # Launch the ChatInterface
    chat_interface.launch(server_name="0.0.0.0")

The components consist of a Deployment, Service, and ConfigMap to launch the application. In particular, the Service component is named gradio-service and is deployed as a LoadBalancer.

~$kubectl apply -k ~/environment/eks-workshop/modules/aiml/chatbot/gradio-mistral

namespace/gradio-mistral-trn1 created

configmap/gradio-app-script created

service/gradio-service created

deployment.apps/gradio-deployment created

To check the status of each component, run the following commands:

~$kubectl get deployments -n gradio-mistral-trn1

NAME                READY   UP-TO-DATE   AVAILABLE   AGE

gradio-deployment   1/1     1            1           95s

~$kubectl get configmaps -n gradio-mistral-trn1

NAME                DATA   AGE

gradio-app-script   1      110s

kube-root-ca.crt    1      111s

Once the load balancer has finished deploying, use the external IP address to directly access the website:

~$kubectl get services -n gradio-mistral-trn1

NAME             TYPE          ClUSTER-IP    EXTERNAL-IP                                                                      PORT(S)         AGE

gradio-service   LoadBalancer  172.20.84.26  k8s-gradioll-gradiose-a6d0b586ce-06885d584b38b400.elb.us-west-2.amazonaws.com    80:30802/TCP    8m42s

To wait until the Network Load Balancer has finished provisioning, run the following command:

~$curl --head -X GET --retry 30 --retry-all-errors --retry-delay 15 --connect-timeout 30 --max-time 60 \

-k $(kubectl get service -n gradio-mistral-trn1 gradio-service -o jsonpath="{.status.loadBalancer.ingress[*].hostname}{'\n'}")

Now that our application is exposed to the outside world, let's access it by pasting the URL in your web browser. You will see the chat bot powered by the Mistral-7B-Instruct-v0.3 model and will be able to interact with it by asking questions.

http://k8s-gradioll-gradiose-a6d0b586ce-06885d584b38b400.elb.us-west-2.amazonaws.com

This concludes the current lab on deploying the Mistral-7B-Instruct-v0.3 model on an EKS cluster and interacting with it through a simple chat bot interface.