HDFS System with hdfs3

3 min readNov 4, 2022

Interacting HDFS with python

step 1 :

open the terminal and make sure that HDFS is running using the following commands

$ source $HADOOP_HOME/sbin/start-dfs.sh
$ source $HADOOP_HOME/sbin/start-yarn.sh

now we can check if all good with this command

$ jps

if the output is the same as the following :

step 2 :

Now we’ll be installing a library called libhdfs3, an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.

1. Create a virtual environment

$ conda create --name hdfs_env
$ conda activate hdfs_env

to get information about your active environment, use this command :

$ conda info

2. install libraries

Now install libhdfs3 inside virtual env :

$ conda install libhdfs3

hdfs3 is small wrapper around libhdfs3. It provides user simple API to access libhdfs3 commands in a pythonic way. Use following the command to install hdfs3

$ conda install hdfs3

3. Connect hdfs File system

for me i am running hdfs on localhost. With the default port for hdfs 9000

we can check if the connection is successful either by using python command line or by Jupyter Notebook.

open a Notebook and execute the following commands

from hdfs3 import HDFileSystem
dfs = HDFileSystem(host='localhost',port=9000)

if it works fine, you’re good to go !

4. Work with hdfs Api

list files

dfs.ls("/") #list the root directories

remove files and directories

for files, we don't need to specify recursive value, but to remove a directory we need to make sure that recursive is set to True

make directories

to create a directory, we can use mkdir function as the following example :

check if the file exists

for more information about the API : https://hdfs3.readthedocs.io/en/latest/api.html

HDFS System with hdfs3

step 1 :

step 2 :

Written by SiD41x4

Responses (1)