PySpark — Implementing Persisting Metastore

Metastore is a central repository that is used to manage relational objects such as database, tables, views etc. Apache Hive, Impala, Spark, Presto etc. implements the same in order to maintain the metadata.

Representation Image (Credits: Apache Spark)

Spark basically allows only two type of catalog Implementation “in-memory” and “hive”. This can be verified with the following configuration parameter.

spark.sql.catalogImplementation

In-memory implementation only persists for the session, once the session is stopped metadata is remove along with it. In-order to persist our relational metadata, we have to change our catalog implementation to hive.

But, what if we don’t have Hive Metastore implemented? Can’t we implement persisting relational entities? The answer is — Yes we can. In order to achieve the same, Spark uses default Apache Derby to create a metastore_db in the current working directory.

Now, how to we enable the same? Lets check that out with an example.

Default spark catalog Implementation.

# Validate CatalogImplementation
spark.conf.get("spark.sql.catalogImplementation")
Default catalog Implementation

Lets create a new SparkSession with Hive catalog implementation and to achieve that we add enablehivesupport() while creating SparkSession object.

# Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Metastore with Hive Support") \
.master("local[*]") \
.config("spark.sql.warehouse.dir", "spark-warehouse") \
.enableHiveSupport() \
.getOrCreate()

spark
Spark Session with hive catalog

Once the SparkSession object is created we can find a metastore_db folder created in default working directory to persist the metadata information. This folder stores all the required metadata.

metastore_db created

By default path specified to spark.sql.warehouse.dir is used to store data for the in-house managed tables. For external tables location needs to specified while creating the same.

Now, lets quickly create a table to checkout our metastore.

# Read data from Parquet

df = spark.read \
.format("parquet") \
.load("dataset/sales.parquet")

df.printSchema()
df.show(10)
Example Dataset

Write the dataset as managed table

# Write dataframe as table

df.write.saveAsTable("sales_managed")
Metadata

Now, if we stop and create a new SparkSession with same default working directory, we will still be able to access all table metadata, since it is permanent/persisted.

%%sparksql

select * from sales_managed limit 10;
Table data

Conclusion: The benefit for metastore implementation comes into play when we have to read the a dataset and we have performance hits due to data scanning.

Checkout this article to understand more on data scanning — https://urlit.me/blog/pyspark-optimize-data-scanning-exponentially/

And now we can directly read data directly — as metastore will store the table name, metadata/schema and path of the table.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/26_metastore_in_pyspark.ipynb

Checkout my Personal blog — https://urlit.me/blog/

Checkout the PySpark Series on Medium — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

Buy me a Coffee

If you like my content and wish to buy me a COFFEE. Click the link below or Scan the QR.
Buy Subham a Coffee
*All Payments are secured through Stripe.

Scan the QR to Pay Securely

About the Author

Subham is working as Senior Data Engineer at a Data Analytics and Artificial Intelligence multinational organization.
Checkout portfolio: Subham Khandelwal