TWIL: September 10, 2023
This week I was mostly focused on Microsoft Fabric, but I also read interesting articles on Computer Vision, Azure AI Document Intelligence, Embeddings and Vector Search. I’m also recommending a few GitHub repos around AI topics, two papers related to Large Language Models and more. Happy learning!
Podcasts
Lex Fridman Podcast
Episode 394: Neri Oxman: Biology, Art, and Science of Design & Engineering with Nature
Neri Oxman is a designer, engineer, scientist, and artist working on computational design, synthetic biology and digital fabrication, previously at MIT, and now at OXMAN.
Microsoft Fabric
Direct Lake
Direct Lake mode is a groundbreaking new dataset capability for analyzing very large data volumes in Power BI. Direct Lake is based on loading parquet-formatted files directly from a data lake without having to query a Lakehouse endpoint, and without having to import or duplicate data into a Power BI dataset. Direct Lake is a fast-path to load the data from the lake straight into the Power BI engine, ready for analysis. The following diagram shows how classic import and DirectQuery modes compare with the new Direct Lake mode.
Getting from Azure Data Factory to Data Factory in Microsoft Fabric
Data Factory in Microsoft Fabric is the next generation of Azure Data Factory which provides cloud-scale data movement and data transformation services that allow you to solve the most complex ETL scenarios. It’s intended to make your experience easy to use, powerful, and truly enterprise-grade. This article compares the differences between Azure Data Factory and Data Factory in Microsoft Fabric.
Microsoft Fabric decision guide: copy activity, dataflow, or Spark
Use this reference guide and the example scenarios to help you in deciding whether you need a copy activity, a dataflow, or Spark for your Microsoft Fabric workloads.
Microsoft Fabric decision guide: data warehouse or lakehouse
Use this reference guide and the example scenarios to help you choose between the data warehouse or a lakehouse for your Microsoft Fabric workloads.
Better together: the lakehouse and warehouse
This article explains the data warehousing experience with the SQL Endpoint of the Lakehouse, and scenarios for use of the Lakehouse in data warehousing.
Use Semantic Kernel with Lakehouse in Microsoft Fabric
Microsoft Fabric allows enterprises to bind different data sources through OneLake, and data engineer can call a unified API for different business scenarios to complete data analysis and data science. This article will describe how to allow data scientists to use Semantic Kernel with Lakehouse in Microsoft Fabric
Introducing the dbt adapter for Synapse Data Warehouse in Microsoft Fabric
We are excited to announce the preview of dbt plugin adapter for Synapse Data Warehouse in Microsoft Fabric (preview). This data platform-specific adapter plugin allows you to connect and transform data into Synapse Data Warehouse in Microsoft Fabric. This is continuing Microsoft’s focus on integration and partnership with dbt Labs.
Microsoft Fabric August 2023 update
Welcome to the August 2023 update. We have lots of features this month including the new layout switcher for Power BI, SSD caching in Synapse Data Warehouse, in-line Python support for KQL in Synapse Real-time Analytics, lookup activity for Data Factory Dataflows, and much more. Continue reading for more details on our new features!
Data Warehouse sharing
Data sharing is essential to fostering a data-driven culture within an organization. Sharing a Warehouse allows you to provide read access to enable downstream users within the organization to consume this data to make data-driven decisions, without having to make copies of data. With this new capability, an Admin or Member within a workspace can share a Warehouse with another recipient (AAD user or AAD groups) within your organization. You can also grant these permissions using the “Manage permissions” experience.
Computer Vision
Do image retrieval using vectorization (version 4.0 preview)
The Image Retrieval APIs enable the vectorization of images and text queries. They convert images to coordinates in a multi-dimensional vector space. Then, incoming text queries can also be converted to vectors, and images can be matched to the text based on semantic closeness. This allows the user to search a set of images using text, without the need to use image tags or other metadata. Semantic closeness often produces better results in search.
Vision Transformer (ViT)
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Azure AI Document Intelligence
Azure AI Document Intelligence new capabilities including classification are now generally available
Azure AI Document Intelligence formerly known as Form Recognizer now has a new set of capabilities generally available! Documents are core to any business process, from Intelligent Document Processing (IDP) solutions like invoice processing to knowledge extraction like tax filing, financial reporting and audits. Azure AI Document Intelligence has AI components for you to build your document processing workflows.
Embeddings / Vector Search
Audio embeddings
We can use audio embeddings to find similarities between audio files. We are going to use here Azure Cognitive Search and its new vector store.
Neural Network Embeddings Explained
In this article, I’ll explain what neural network embeddings are, why we want to use them, and how they are learned. We’ll go through these concepts in the context of a real problem I’m working on: representing all the books on Wikipedia as vectors to create a book recommendation system.
WTF Is a Vector Database: A Beginner’s Guide!
Vector databases have gained significant importance in various fields due to their unique ability to efficiently store, index, and search high-dimensional data points, often referred to as vectors. These databases are designed to handle data where each entry is represented as a vector in a multi-dimensional space. The vectors can represent a wide range of information, such as numerical features, embeddings from text or images, and even complex data like molecular structures.
OpenAI Embeddings and Vector Databases Crash Course
Embeddings and Vectors are a great way of storing and retrieving information for use with AI services. OpenAI provides a great embedding API to do this. In this video we will explore how to create a Vector Database by creating embeddings using the OpenAI API and then storing them in SingleStore. The first part of the video will cover how to create an embedding using just API requests with Postman. Then we will jump into Single Store and store these in a new database made specifically for vectors like this.
Interesting GitHub Repos
Visual search with Azure Computer Vision and Azure Cognitive Search
We will use vectors embeddings generation using Azure Computer Vision 4 and Azure Cognitive Search and its new vectors support capabilities to build a visual search application.
Video search with Azure Computer Vision 4 (Florence)
A quick prototype of a video analytics solution to analyse content from a video.
Tutorial: ChatGPT + Enterprise data with Semantic Kernel, OpenAI and Azure Cognitive Search.
This progressive tutorial is for building your own AI chat application informed with your enterprise data. In Chapter 1, we start with building a simple ChatGPT-like application using Semantic Kernel (SK). Chapter 2 imports files into a “Memories Store” for reference by the SK orchestrator when chatting. Having the data from these files allows SK to build better prompts so the AI can offer better answers to questions – this is a key part of the Retrieval Augmented Generation (RAG) pattern. Chapter 3 extends the context of the chat application by using Azure Cognitive Search for data indexing and retrieval.
LLMs
Lost in the Middle: How Language Models Use Long Contexts
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
Current literature, aiming to surpass the “Chain-of-Thought” approach, often resorts to an external modus operandi involving halting, modifying, and then resuming the generation process to boost Large Language Models’ (LLMs) reasoning capacities. This mode escalates the number of query requests, leading to increased costs, memory, and computational overheads. Addressing this, we propose the Algorithm of Thoughts — a novel strategy that propels LLMs through algorithmic reasoning pathways, pioneering a new mode of in-context learning. By employing algorithmic examples, we exploit the innate recurrence dynamics of LLMs, expanding their idea exploration with merely one or a few queries. Our technique outperforms earlier single-query methods and stands on par with a recent multi-query strategy that employs an extensive tree search algorithm. Intriguingly, our results suggest that instructing an LLM using an algorithm can lead to performance surpassing that of the algorithm itself, hinting at LLM’s inherent ability to weave its intuition into optimized searches. We probe into the underpinnings of our method’s efficacy and its nuances in application.
Falcon Models
Falcon 180B is a super-powerful language model with 180 billion parameters, trained on 3.5 trillion tokens. It’s currently at the top of the Hugging Face Leaderboard for pre-trained Open Large Language Models and is available for both research and commercial use. This model performs exceptionally well in various tasks like reasoning, coding, proficiency, and knowledge tests, even beating competitors like Meta’s LLaMA 2. Among closed source models, it ranks just behind OpenAI’s GPT 4, and performs on par with Google’s PaLM 2 Large, which powers Bard, despite being half the size of the model.
PromptPerfect
Unlock the full potential of prompt engineering, a vital key to outstanding AI-generated content, without the complexity. With PromptPerfect, you can easily develop, debug, and deploy optimized prompts for models like GPT4, ChatGPT, MidJourney, DALL-E 2, and StableDiffusion.
Have an awesome week!