Date:
Estimated Time:less than a minute
Extracting raw data from hdfs
I have tested multiple ways to get data from hdfs. There is two situations:
→ The data fits in a local node RAM memory:
# will produce a local csv
sql("select * from my_table where true")
.toPandas()
.to_csv("myLocalFile.csv", encoding="utf8")
# otherwise spark shell keeps running
exit(0)# run the python script:
PYTHONSTARTUP=my/python/path/prog.py pyspark --master yarn [...]→ The dataset does not fit into RAM memory:
# will produce some csv on hdfs
sql("select * from my_table where true")
.write()
.format("csv")
.save("/output/dir/on/hdfs/")
# otherwise spark shell keeps running
exit(0)# run the python script:
PYTHONPATH=my/python/path/prog.py pyspark --master yarn [...]
# merge the files on the local filesystem
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.csv This page was last modified:

