TWIL: February 12, 2023

This week most of my learning was focused on Azure Open AI Service, and how to use GPT-3 models for specific use cases, either by fine tuning them or by leveraging embeddings. I’m also highlighting two articles on data privacy and content filtering in Open AI Service, and the awesome Galileo AI. Enjoy!


Architecture

Armchair Architects: Is Big Data Turning into Dark Data?
I think it’s time we talk about data and in particular, I want to talk about this thing I think that people are using to scare children at night when they want them to go to bed or something. They talk about dark data, but I don’t actually perhaps know what dark data means. I know what big data means, but apparently there’s this concern that big data is turning into dark data so can we just get the dark data question out on the table first? What do we mean when we say dark data and then then we can dive into this more fully.


Open AI: Fine Tuning

Fine-tuning GPT-3 Using Python to Create a Virtual Mental Health Assistant Bot
Hello again! In my previous article, I outlined the steps required to integrate GPT-3 and Dialogflow, by creating a Virtual Mental Health Assistant. In this one, we will refine the Mental Health Chatbot we created, by learning how to fine-tune our GPT-3 model.


Open AI: Embeddings

Beyond Semantic Search with OpenAI and Pinecone
An interactive and practical workshop on building AI-powered semantic applications on top of OpenAI and Pinecone. We will do an end-to-end tutorial on building semantic search, text summarization, question-answering, entity extraction, and other applications. No ML expertise is needed: If you know how to use a REST API, this is for you.

Introducing Text and Code Embeddings
We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Our embeddings outperform top models in 3 standard benchmarks, including a 20% relative improvement in code search.

Build ChatGPT-like Chatbots With Customized Knowledge for Your Websites, Using Simple Programming
As impressive as ChatGPT is, it would be even cooler if there was a way to integrate it into your own website and train it with customized information. Imagine being able to create a chatbot that is tailored to your business or one that can hold intelligent conversations with your friends and family.

Question Answering using Embeddings
Many use cases require GPT-3 to respond to user questions with insightful answers. For example, a customer support chatbot may need to provide answers to common questions. The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information. In this notebook we will demonstrate a method for enabling GPT-3 able to answer questions using a library of text as a reference, by using document embeddings and retrieval. We’ll be using a dataset of Wikipedia articles about the 2020 Summer Olympic Games.

Learn how to generate embeddings with Azure OpenAI
An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

Zero-shot classification with embeddings
Zero-shot classification is the prediction of text content with respect to a given label; the difference in cosine similarity between the word embeddings of words in the text and the label are used for prediction.

Cosine similarity
In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle.


Open AI: Compliance

Data, privacy, and security for Azure OpenAI Service
This article provides details regarding how data provided by you to the Azure OpenAI service is processed, used, and stored. Azure OpenAI stores and processes data to provide the service, monitor for abusive use, and to develop and improve the quality of Azure’s Responsible AI systems. Please also see the Microsoft Products and Services Data Protection Addendum, which governs data processing by the Azure OpenAI Service except as otherwise provided in the applicable Product Terms. Azure OpenAI was designed with compliance, privacy, and security in mind; however, the customer is responsible for its use and the implementation of this technology.

OpenAI Service: Content filtering
Azure OpenAI Service includes a content management system that works alongside core models to filter content. This system works by running both the input prompt and generated content through an ensemble of classification models aimed at detecting misuse. If the system identifies harmful content, you’ll receive either an error on the API call if the prompt was deemed inappropriate or the finish_reason on the response will be content_filter to signify that some of the generation was filtered.


Open AI: Use Cases

Using ChatGPT-3 and Power Automate to triage your bugs
I’ve been mulling over different use case experiments I can undertake with the ChatGPT-3 language model, and last night – lying in bed – I had a light-bulb moment. What about giving ChatGPT-3 a first crack at triaging bugs we raise as a normal part of the software development lifecycle? Thanks to awesome integration between Azure DevOps and Power Automate, I was able to set up a Power Automate flow which triggers any time a new bug is raised.


Azure Synapse Analytics

Design a PolyBase data loading strategy for dedicated SQL pool in Azure Synapse Analytics
Traditional SMP data warehouses use an Extract, Transform, and Load (ETL) process for loading data. Azure SQL pool is a massively parallel processing (MPP) architecture that takes advantage of the scalability and flexibility of compute and storage resources. An Extract, Load, and Transform (ELT) process can take advantage of built-in distributed query processing capabilities and eliminate resources needed to transform the data before loading.


Microsoft News

Microsoft thinks AI can beat Google at search — CEO Satya Nadella explains why
Microsoft announced that the next version of the Bing search engine would be powered by OpenAI, the company that makes ChatGPT. There’s also a new version of the Edge web browser with OpenAI chat tech in a window that can help you browse and understand web pages.

Cool Stuff

Galileo AI
Trained on thousands of outstanding designs, Galileo AI turns natural language prompts into high-fidelity designs, editable in Figma.


Have a great week!