Microsoft is constantly complementing the development of various fields through .NET Core, truly realizing the cross-platform of a language and part of its opensource commitment. .NET for Apache Spark gives you APIs for using Apache Spark from C# and F#. With the .NET APIs you can access all aspects of Apache Spark including Spark SQL, for working with structured data, and Spark Streaming. Let's learn a bit more around this.
What is .NET for Apache Spark?
We all know that Spark is a popular open source distributed processing engine suitable for the analysis of large data sets (usually terabytes). Spark can be used to process batch data, real-time streaming, machine learning, and instant query.
Processing tasks are distributed on a cluster of nodes, and data is cached in memory to reduce computing time. So far, Spark has been accessible through Scala, Java, Python and R, but not through .NET.
And .NET for Apache Spark is designed to enable .NET developers to access Apache®Spark™ across all Spark APIs.
.NET for Apache Spark provides high-performance APIs for C# and F# to operate Spark. Using this .NET API, you can access all the features of Apache Spark, including SparkSQL, DataFrames, streams, MLLib, and more. .NET for Apache Spark allows you to reuse all the knowledge, skills, code, and libraries that you already have as a .NET developer.
C#/F# language binding to Spark will be written into a new Spark interaction layer, which provides easier scalability. The writing of this new Spark interaction layer takes into account the best practices of language extension and is optimized for interaction and performance. In the long run, this scalability can be used to add support for other languages in Spark.
.NET for Apache Spark complies with the .NET Standard 2.0 standard and can be used on Linux, MacOS and Windows. Official website address: https://dotnet.microsoft.com/apps/data/spark
Quick start .NET for Apache Spark
In this section, we will show how to use .NET Core to run .NET for Apache Spark applications on Windows. Before you start using .NET for Apache Spark, you really need to install some things, such as: .NET Core 2.1 SDK | Visual Studio 2019 | Java 1.8 | Apache Spark 2.4.x. Specific steps can refer to these steps to start .net for Apache Spark . Once installed, you can start writing Spark applications in .NET in three simple steps. In our first .NET Spark application, we will write a basic Spark pipeline that will count the number of occurrences of each word in a text segment.
// 1. Create a Spark session var spark = SparkSession .Builder() .AppName("word_count_sample") .GetOrCreate(); // 2. Create a DataFrame DataFrame dataFrame = spark.Read().Text("input.txt"); // 3. Manipulate and view data var words = dataFrame.Select(Split(dataFrame["value"], " ").Alias("words")); words.Select(Explode(words["words"]) .Alias("word")) .GroupBy("word") .Count() .Show();
Features of .NET For Apache Spark
Can use C# or F# for Apache Spark development
.NET for Apache Spark provides you with APIs that use C# and F# to operate Apache Spark. Using these .NET APIs, you can access all the features of Apache Spark, including Spark SQL, for processing structured data and Spark streaming.
The first version of .NET for Apache Spark performed very well in the popular TPC-H benchmark performance test. The TPC-H benchmark performance test consists of a set of business-oriented queries. The following figure shows the performance comparison of .NET Core, Python and Scala on the TPC-H query set.
The above chart shows the performance comparison of each query of .NET for Apache Spark compared to Python and Scala. NET for Apache Spark performs well on Python and Scala. In addition, in situations where UDF performance is critical, such as query 1, the transfer of 3B rows of non-string data between JVM and CLR.NET is 2 times faster than Python. Equally important, this is the first preview version of .NET for Apache Spark, and our goal is to further invest in improvements and benchmark performance (for example, Arrow optimization). You can follow our instructions to benchmark this on our GitHub repository.
Image Source: Microsoft
Image Source: Microsoft
Leverage the .NET ecosystem
.NET For Apache Spark allows you to reuse all the knowledge, skills, code, and libraries that you already have as a .NET developer.
Your data processing code can also take advantage of the large library ecosystem available to .NET developers, such as Newtonsoft.Json, ML.NET, MathNet.NDigics, NodaTime, etc.
.NET for Apache Spark can be used on Linux, MacOS and Windows, just like other parts of .NET. .NET for Apache Spark is available by default in Azure HDInsight and can be installed in Azure Databricks, Azure Kubernetes service, AWS database, AWS EMR, etc.
Open source and free
.NET for Apache Spark is part of a strong open source community with more than 60,000 code contributors from more than 3,700 companies. .NET is free, including .NET for Apache Spark. There are no fees or license fees, including fees for commercial use.
.NET For Apache Spark's next plan
Today is the first step of our journey. The following are some of the features of our recent roadmap.
Simplify introductory experience, documentation and examples
Natively integrated into developer tools, such as VisualStudio, VisualStudio Code, Jupiter Notebook
.net support for user-defined aggregate functions
NET C# and F# idiomatic API (for example, use LINQ to write queries)
Use out-of-the-box support provided by Azure database, Kubernetes, etc.
Make .NET for Apache Spark part of Spark Core.
.NET for Apache Spark is a milestone of Microsoft's opensource commitment in making .NET a great technology stack for building big data applications.
For more information, you can visit the Github repository of .NET for Apache Spark: https://github.com/dotnet/spark
The content of this article, partly referenced from: https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark