Date:
Estimated Time:less than a minute
Extracting raw data from hdfs
I have tested multiple ways to get data from hdfs. There is two situations:
→ The data fits in a local node RAM memory:
# will produce a local csv
"select * from my_table where true")
sql(
.toPandas()"myLocalFile.csv", encoding="utf8")
.to_csv(# otherwise spark shell keeps running
0) exit(
# run the python script:
PYTHONSTARTUP=my/python/path/prog.py pyspark --master yarn [...]
→ The dataset does not fit into RAM memory:
# will produce some csv on hdfs
"select * from my_table where true")
sql(
.write()format("csv")
."/output/dir/on/hdfs/")
.save(# otherwise spark shell keeps running
0) exit(
# run the python script:
PYTHONPATH=my/python/path/prog.py pyspark --master yarn [...]
# merge the files on the local filesystem
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.csv
This page was last modified: