step 1 :
open the terminal and make sure that HDFS is running using the following commands
$ source $HADOOP_HOME/sbin/start-dfs.sh
$ source $HADOOP_HOME/sbin/start-yarn.sh
now we can check if all good with this command
$ jps
if the output is the same as the following :
step 2 :
Now we’ll be installing a library called libhdfs3, an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.
1. Create a virtual environment
$ conda create --name hdfs_env
$ conda activate hdfs_env
to get information about your active environment, use this command :
$ conda info
2. install libraries
Now install libhdfs3 inside virtual env :
$ conda install libhdfs3
hdfs3 is small wrapper around libhdfs3. It provides user simple API to access libhdfs3 commands in a pythonic way. Use following the command to install hdfs3
$ conda install hdfs3
3. Connect hdfs File system
for me i am running hdfs on localhost. With the default port for hdfs 9000
we can check if the connection is successful either by using python command line or by Jupyter Notebook.
open a Notebook and execute the following commands
from hdfs3 import HDFileSystem
dfs = HDFileSystem(host='localhost',port=9000)
if it works fine, you’re good to go !
4. Work with hdfs Api
- list files
dfs.ls("/") #list the root directories
- remove files and directories
for files, we don't need to specify recursive value, but to remove a directory we need to make sure that recursive is set to True
- make directories
to create a directory, we can use mkdir function as the following example :
- check if the file exists
for more information about the API : https://hdfs3.readthedocs.io/en/latest/api.html