PySpark — Read Binary Files like PNG or PDF

Can Spark read .png or .pdf file? The answer is YES. Spark can read almost any type of file as binary file into data frame.

Spark has a binaryFile in-built format to load any Binary file and store the content as binary. The BLOB or binary content can be later written back to appropriate file format as per requirement.

Lets read some binary files quickly for demonstration. Files we are going to read

%%sh
ls -lhtr dataset/files
Check files available

Lets read one .png file to check the output of the data frame

# Lets read a .png file
df_spark_png = spark \
.read \
.format("binaryFile") \
.load("dataset/files/spark.png")
df_spark_png.printSchema()
df_spark_png.show()
PNG file as binary

We read all .png files from path

# Lets read all .png file
df_spark_png = spark \
.read \
.format("binaryFile") \
.load("dataset/files/*.png")
df_spark_png.printSchema()
df_spark_png.show()
All PNG files in path

Can we read a PDF file? Yes

# We can even read PDF files
df_spark_pdf = spark \
.read \
.format("binaryFile") \
.load("dataset/files/*.pdf")
df_spark_pdf.printSchema()
df_spark_pdf.show()
PDF file as binary

Can we read a TXT file as Binary ? Yes

# We can even read Text files as binary files
df_spark_txt = spark \
.read \
.format("binaryFile") \
.load("dataset/example.txt")
df_spark_txt.printSchema()
df_spark_txt.show()
TXT file as Binary

So, now can we write back the files from binary content ? Yes

# Lets generate the text file back from the binary content
byte_content = df_spark_txt.select("content").collect()[0][0]
# Lets write the byte content as file back
with open("dataset/new_example.txt", "wb") as f:
f.write(byte_content)
f.close()
Binary to TXT file

As demonstrated, Spark can read any file as binary for storage. Later we can write the binary content back to respective file format as per usage.

Check out the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/13_binary_files.ipynb

Check out PySpark Series on Medium — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

Buy me a Coffee

If you like my content and wish to buy me a COFFEE. Click the link below or Scan the QR.
Buy Subham a Coffee
*All Payments are secured through Stripe.

Scan the QR to Pay Securely

About the Author

Subham is working as Senior Data Engineer at a Data Analytics and Artificial Intelligence multinational organization.
Checkout portfolio: Subham Khandelwal