pyspark.SparkContext.wholeTextFiles¶
- 
SparkContext.wholeTextFiles(path: str, minPartitions: Optional[int] = None, use_unicode: bool = True) → pyspark.rdd.RDD[Tuple[str, str]][source]¶
- Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. The text files must be encoded as UTF-8. - New in version 1.0.0. - For example, if you have the following files: - hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn - Do - rdd = sparkContext.wholeTextFiles("hdfs://a-hdfs-path"), then- rddcontains:- (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) - Parameters
- pathstr
- directory to the input data files, the path can be comma separated paths as a list of inputs 
- minPartitionsint, optional
- suggested minimum number of partitions for the resulting RDD 
- use_unicodebool, default True
- If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. - New in version 1.2.0. 
 
- Returns
- RDD
- RDD representing path-content pairs from the file(s). 
 
 - Notes - Small files are preferred, as each file will be loaded fully in memory. - Examples - >>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... # Write a temporary text file ... with open(os.path.join(d, "1.txt"), "w") as f: ... _ = f.write("123") ... ... # Write another temporary text file ... with open(os.path.join(d, "2.txt"), "w") as f: ... _ = f.write("xyz") ... ... collected = sorted(sc.wholeTextFiles(d).collect()) >>> collected [('.../1.txt', '123'), ('.../2.txt', 'xyz')]