September 25, 2024 ⏱️ 15 min
By Catalin M. (RnD – Cloud Group)
In the ever-changing digital landscape, as data sets expand exponentially and their analysis grows increasingly complex, organizations are constantly searching for ways to extract deeper insights to enhance decision-making.
To address this need, innovative solutions like Artificial Intelligence (AI) and Cloud Computing have emerged as powerful tools for modern data analytics.
In this article we will have a look at Microsoft Fabric to understand key concepts like workspaces, notebooks and ML experiments. We will also get an overview of the main components that make up this SAAS platform like OneLake, Spark Compute, Event Streams, Data Activator and Reflexes.
In order to give you a better understanding of how and where each piece is used, we will guide our explanations around the normal flow of the data as it enters MS Fabric in the Data Ingestion step, then as it is prepared and analyzed in the Data Analysis step and finally how we can have Actionable Decisions based on the results of the analysis.
What is Cloud AI Analytics
AI Analytics leverages AI techniques and algorithms to automate data analysis, interpret data, extract insights, and provide predictions or recommendations. By utilizing advanced technologies such as machine learning, natural language processing, and data visualization, AI analytics significantly enhances decision-making capabilities. It enables organizations to reduce costs, minimize errors, and improve accuracy, freeing up human resources for more strategic tasks.
Shifting AI analytic workloads to the cloud offers additional benefits like scalability, flexibility, and cost-efficiency, allowing organizations to manage data more effectively. In recent years, cloud providers invested heavily in cloud computing platforms and services tailored for data analysis and business intelligence. These platforms are specifically designed to process and analyze large datasets, allow the automation of these processes, and deliver precise insights and actionable recommendations.
Impact on business
AI Cloud Analytics combines the computational power of AI with the scalability and flexibility of cloud computing. This combination allows for unprecedented capabilities in data processing, analysis, and interpretation. From predictive analytics to real-time decision-making, AI Cloud Analytics is revolutionizing intelligence across all business sectors.
By harnessing the power of AI and cloud computing, businesses can unlock new opportunities and redefine what is possible in their industries.
Here are some examples of applicability:
Finance
- Fraud Detection: AI can analyze patterns to identify fraudulent transactions in real time.
- Risk Management: Predictive analytics helps in assessing and mitigating risks.
- Investment Strategies: AI models can predict market trends, optimize portfolios, and automate trading.
Healthcare
- Diagnostics: AI can analyze medical images and data for early disease detection.
- Personalized Treatment: AI-driven insights help tailor treatments based on individual patient data.
Retail
- Inventory Management: Predictive analytics optimizes stock levels and reduces waste.
- Supply Chain Optimization: AI enhances logistics and reduces delays.
- Pricing Strategies: Dynamic pricing models adjust prices based on demand, competition, and other factors.
Manufacturing
- Predictive Maintenance: AI predicts equipment failures and schedules maintenance proactively.
- Quality Control: AI-driven image recognition detects defects in products during production.
Transportation and Logistics
- Route Optimization: AI determines the most efficient routes, reducing fuel costs and delivery times.
- Demand Forecasting: AI predicts transportation needs, helping in fleet management.
Agriculture
- Yield Prediction: AI models predict crop yields, helping in better planning and resource allocation.
Insurance
- Claims Processing: AI automates and speeds up the claims process, reducing human error.
- Risk Assessment: AI analyzes data to assess and price risk accurately.
Azure Microsoft Fabric - Workloads and Key Concepts
Microsoft Fabric is an emerging analytical platform that brings together the power of Azure to deliver an integrated and seamless data experience. Designed to address the complexities of modern data management, Microsoft Fabric unifies data engineering, data integration, data science, and business intelligence into a single platform. With its robust scalability, advanced AI capabilities, and tight integration with Azure services, Microsoft Fabric represents the next evolution in data analytics, offering a powerful toolset for organizations looking to stay ahead in a data-driven world.
Even though Microsoft Fabric is a fairly new platform, available to the general public since November 2023, is already stealing focus from previous Azure solutions that required more integration to achieve a full end-to-end data processing platform, such as Synapse Analytics, Data Factory and Power BI.
The Microsoft Fabric SAAS platform that can be thought of as an all-in-one analytics platform because it handles everything from data storage and migration to real-time analytics and data science. It has its own portal that consolidates many existing and new Azure services needed for data analysis, with the goal of integrating and simplifying interactions between them.
This simplicity allows data professionals to focus on results rather than the technology they use. It also means data teams don’t have to spend hours figuring out how the licensing for Synapse, Azure Data Factory, and Power BI will interact with one another.
Microsoft Fabric architecture has several workloads that run on top of OneLake (Microsoft’s all-in-one storage layer):
1. Data Factory: The data integration service
2. Microsoft Synapse Analytics: Microsoft Synapse Analytics tools have been integrated into Microsoft Fabric
- Synapse Data Warehousing: Lake-centric warehousing that scales compute and storage independently
- Synapse Data Engineering: A Spark service for designing, building, and maintaining your data estate to support data analysis
- Synapse Data Science: A service to create and deploy end-to-end data science workflows at scale
- Synapse Real-Time Analytics: Cloud-based analysis of data from apps, websites, and device
3. Power BI: Microsoft’s flagship business intelligence service
4. Data Activator: A no-code experience for data observability and monitoring
OneLake
The MS Fabric platform is OneLake centric, which is natural because everything revolves around data whether it is ingested, cleaned, transformed, analyzed, or inferred. Organizations can combine data from several sources into a single source of truth.
The reason for having OneLake as the single-entry point for all storage is to prevent the accumulation of data silos. This tends to happen mainly in large organizations that are divided in smaller units where data starts to accumulate in isolated storages. This, in turn leads to data division which makes knowledge-sharing difficult due to lack of access, increases redundancy, and reduces the opportunity to gain a greater insight that could be achieved only by looking at the entire data.
In this sense, OneLake is a logical layer on top of the actual data storages and there are currently 3 main types of storages using Delta Parquet under the hood for structured data:
- Lakehouse : Can store unstructured/semi-structured data as files (text, image, video, audio, Json, csv, xml) as well as structured data (tables)
- Data warehouse: Here the experience is similar to SQL and is used to store structured data in tables with schemas, views, SPs, etc
- KQL database: Can be created inside an Eventhouse (a logical grouping) and is a type of storage optimized for working with large volumes of time-series data and streaming data, using the Kusto Query Language
Workspaces
Fabric also comes with workspaces, which serve as a logical separation of resources based on your workflows and use cases. This also can be used for collaboration and sharing with other parts of your organization by distributing ownership and access policies. Also, this helps with the billing process if your organization has separate cost centers as each workspace is part of a capacity that is tied to a specific region and is billed separately.
Spark Compute
Spark Compute is a key component of Microsoft Fabric, enabling data engineering and data science scenarios on a fully managed Spark compute platform that delivers unparalleled speed and efficiency.
In Microsoft Fabric, there are two different ways to run Spark code: through a Spark job definition and using notebooks. Each method has its advantages and limitations, so the choice depends on your specific needs and priorities. Here are some guidelines in a nutshell:
- Notebooks: Choose for iterative exploration, prototyping, and quick analysis.
- Spark job definitions: Choose for scheduled tasks, complex pipelines, and production-level data processing.
- You can also combine both approaches: Use notebooks for initial exploration and development, then translate the final code into a Spark job definition for production deployment.
When working with MS Fabric in general, no matter what problem you are trying to solve, there are 3 main topics that will be important for the entire process and will need solving:
- Data Ingestion
- Data Analysis
- Actionable Decisions
Data Ingestion
Data ingestion refers to the process of collecting and importing large volumes of data into Azure for processing and storage. This data can be generated either by users but in most cases by devices. The best tool in Azure for this is Azure IoT Hub that acts as a central message hub for bi-directional communication between IoT devices and the cloud, enabling real-time data ingestion from millions of devices. It supports various protocols, ensures secure communication, and handles device-to-cloud telemetry, commands, and device management tasks. Once ingested, the data can be routed to other Azure services for further analysis and storage. In our case we will want to route this data in OneLake.
Keep in mind that with IoT devices some level of validation can be also performed on premise by the devices themselves thus reducing the ingestion of unnecessary data in the cloud.
Now depending on the type of data we need to ingest, the initial landing place in OneLake could be different:
- Lakehouse which is typically your go to storage for “Bronze” data, meaning data that is not yet structured, cleaned, validated or normalized
- KQL Database for data that is already in a structured format
There are different solutions available to ingest data through MS Fabric into OneLake.
Notebooks
Using Notebooks a data scientist can already start to process this data and migrate it from an unstructured to a structured format. Notebooks are built on top of the spark engine and the languages that can be used are: PySpark (Python), Spark (Scala), Spark SQL, SaprkR.
Data Flows
With Data Flows there are more than 300 connectors available to help you import / migrate data to different parts of your OneLake to be made available where it is needed most.
Pipelines
Pipelines are mainly an orchestration tool to create a workflow of activities with control flow logic (looping, branching) but can also be used to fetch data (with the copy data activity) and trigger/invoke a lot of actions in Fabric but also in the Cloud (Invoke Dataflows, other Pipelines, Notebooks, Azure Functions, etc.)
Event Stream
Event streams are considered part of the Microsoft Fabric Real-Time Intelligence experience, and they allow you to bring real-time events into Fabric, transform them, and then route them to various destinations without writing any code (multiple destinations are also possible for the same event stream).
The event streams feature provides you with connectors to fetch event data from various sources for example:
- Azure Event Hubs
- Azure IoT Hub
- Azure SQL Database Change Data Capture (CDC)
For the destinations, again, several options are available:
- KQL Database
- Lakehouse
- Reflex
To capture the data streaming from devices we could use the Event Stream as our main ingestion technique. The Event Stream can be configured to ingest data directly from the IoT Hub and send it in OneLake. We might also have to use multiple chained Event Streams to be able to push the data from one layer of our storage to the other while we are transforming and aggregating it to bring it to its final form that can be used further for analysis.
Data Analysis
After the collected data is fed into different storages (layers) of the OneLake warehouse it’s time to start analyzing it. The first step is to identify the data that will give out the patterns that you are trying to predict. Identifying and prepping this data is mainly the job of a data scientist with a strong math and statistical background that should work together with industry professionals that know how the business works.
Azure AI Services
Training your own model is not an easy task so the first go-to option should be to use already trained models for your specific needs. Azure AI Services is a collection of ready-to-use artificial intelligence (AI) APIs that allow developers to incorporate advanced AI capabilities into their projects.
These services provide features such as computer vision, natural language processing, speech recognition, machine translation, and more:
- Azure AI Search – Bring AI-powered cloud search to your mobile and web apps
- Azure OpenAI – Perform a wide variety of natural language tasks
- Bot Service – Create bots and connect them across channels
- Content Safety – An AI service that detects unwanted contents
- Document Intelligence – Turn documents into intelligent data-driven solutions
- Face – Detect and identify people and emotions in images
- Immersive Reader – Help users read and comprehend text
- Language – Build apps with industry-leading natural language understanding capabilities
- Speech – Speech to text, text to speech, translation, and speaker recognition
- Translator – Use AI-powered translation technology to translate more than 100 in-use, at-risk, and endangered languages and dialects
- Video Indexer – Extract actionable insights from your videos
- Vision – Analyze content in images and videos
The Azure AI services can be invoked directly via HTTP using the APIs or from Python code using the SynapseML library of from C# code using the Azure.AI namespace.
ML Model Training
In some cases, using generic available models is not enough so we must take the harder path of training our own model though this process doesn’t need to start from scratch but use a pretrained model and improve on that. Training a ML model requires us to have clean, valid and labeled historical data that contain the relevant features which can lead to the predictions we are trying to make.
Common operations to enrich this data are to group it over periods of time thus adding additional features such as rolling aggregations, statistical measures, and time-based features to capture temporal patterns.
After the prepping of data is ready, we can start the actual training of the ML model. Notebooks are the go-to tool here because they leverage the power of the Apache Spark Cluster (that runs seamlessly under the hood of notebooks) together with the rich suite of Python libraries like PySpark, SynapseML and Mlflow. This allows us to manipulate data from various sources, use industry standard statistical algorithms to train ML models, create experiments and log model versions, parameters and performance.
Experiments
Data scientists can use experiments to navigate various model training runs and explore underlying parameters and metrics. A machine learning experiment is the primary unit of organization and control for all related machine learning work and contains a collection of runs for simplified tracking and comparison. By comparing runs within an experiment, data scientists can identify which subset of parameters yields the desired model performance. In MS Fabric Experiments can be managed and visualized through the UI but can also be created and used in Notebooks with MLflow
Predictive Insights
Once our model has been trained and we are happy with its performance and accuracy we can start using this in production.
In MS Fabric a model is a separate standing entity that lives in the workspace. The simplest way to use it is in a Notebook where we can connect to any data source in OneLake, instantiate the model and apply it to our data using Python code.
The strength of MS Fabric can be coupled with other Azure services. For example, we could have compute services running in a Kubernetes cluster that connect to OneLake and use ML models to draw insights from that data. As more data is collected and analyzed, the AI models in Microsoft Fabric continue to learn and improve, making the predictions more accurate over time.
Actionable Decisions
Fabric also comes with a feature called Data Activator which is a no-code system of detection targeted at non-technical analysts that understand and focus on the business and can define the appropriate business rules.
Datasets will be continuously monitored for specific conditions and when these rules are met the triggers are fired leading to one or multiple actions being taken. These actions could be sending a message, an email or even a custom action.
Data Activator is also built for streaming data so it is only fitting that it can be connected to an Event Stream as the destination. In Fabric the Data Activator destination is called a Reflex. This means, for example, that a Reflex could monitor a temperature sensor and if the value exceeds a threshold an email could be sent to alert a responsible that could take physical actions like to slow down or even shut down the machine preventing critical damage. Using Power Automate opens up multiple possibilities for taking actions like making an HTTP call to trigger an Azure function or sending a message to an Azure Service Bus queue.
Fabric Solutions
The use cases and solutions adopted can vary greatly depending on the industry and specific company needs. To help you kickstart your experience, MS Fabric comes with something called Industry Solutions. These are boilerplate solutions for specific verticals that act as a guide to optimally allocate and configure the right resources in your workspace. These could be especially helpful when you are new to Fabric and don’t always know what the right tool for the job is.
Currently there are 3 Industry Solutions available, but we expect more to come in the future:
1. Healthcare data solutions
- FHIR data ingestion
- Healthcare data foundations
- Unstructured clinical notes enrichment
- OMOP analytics
- Dynamics 365 Customer Insights – Data preparation
- DICOM data ingestion
2. Retail data solutions
- Retail industry data model
- Sitecore OrderCloud connector
- Frequently bought together model
- Copilot template for personalized shopping
3. Sustainability solutions
- ESG data estate
- Microsoft Azure emissions insights
- Environmental metrics and analytics
- Social and governance metrics and reports
Costs
Before you dive in, do not forget to check the associated cost, because, as you would expect for any Cloud product, the more it goes towards SAAS, the bigger the price tag. To get an idea, currently, just to get started with a minimal “pay as you go” F2 capacity the price is around 345 USD/month and can drop to 215 USD if you commit to a 1-year reservation while for production environments you can expect a minimum of several thousand USD.
Conclusions
As we dive deeper into the era of AI and cloud computing, AI Cloud Analytics is set to create some big changes for businesses and society. Organizations are at a crossroad where they need to find new ways to adapt to today’s digital challenges. The key is to identify the situations where intelligent data analysis can enhance the decision-making processes and potentially give you a glimpse into the future.
Whether you chose to use MS Fabric or not depends on your company and the knowledge of the people involved in maintaining the system.
Many of the features available in MS Fabric can also be implemented directly in Azure, offering sometimes more flexibility but also increased complexity. For a team that is more focused on business processes and less experienced with code and infrastructure, MS Fabric looks like a good option. Ultimately, the purpose of the MS Fabric SAAS platforms is to abstract the complexity of building a system and empower people who really know their business to achieve much more independently.
One last thing to keep in mind is that many of these features are still in Public Preview, meaning the product still has room to evolve and mature. It would be wise for now to keep an eye on its evolution and adoption as these will be the main driving factors for Microsoft to keep investing in this direction.