pyspark.pandas.DataFrame.interpolate¶
- 
DataFrame.interpolate(method: str = 'linear', limit: Optional[int] = None, limit_direction: Optional[str] = None, limit_area: Optional[str] = None) → pyspark.pandas.frame.DataFrame[source]¶
- Fill NaN values using an interpolation method. - Note - the current implementation of interpolate uses Spark’s Window without specifying partition specification. This leads to moveing all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method with very large datasets. - New in version 3.4.0. - Parameters
- method: str, default ‘linear’
- Interpolation technique to use. One of: - ‘linear’: Ignore the index and treat the values as equally spaced. 
 
- limit: int, optional
- Maximum number of consecutive NaNs to fill. Must be greater than 0. 
- limit_direction: str, default None
- Consecutive NaNs will be filled in this direction. One of {{‘forward’, ‘backward’, ‘both’}}. 
- limit_area: str, default None
- If limit is specified, consecutive NaNs will be filled with this restriction. One of: - None: No fill restriction. 
- ‘inside’: Only fill NaNs surrounded by valid values (interpolate). 
- ‘outside’: Only fill NaNs outside valid values (extrapolate). 
 
 
- Returns
- Series or DataFrame or None
- Returns the same object type as the caller, interpolated at some or all NA values. 
 
 - See also - fillna
- Fill missing values using different methods. 
 - Examples - Filling in NA via linear interpolation. - >>> s = ps.Series([0, 1, np.nan, 3]) >>> s 0 0.0 1 1.0 2 NaN 3 3.0 dtype: float64 >>> s.interpolate() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64 - Fill the DataFrame forward (that is, going down) along each column using linear interpolation. - Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains NA, because there is no entry before it to use for interpolation. - >>> df = ps.DataFrame([(0.0, np.nan, -1.0, 1.0), ... (np.nan, 2.0, np.nan, np.nan), ... (2.0, 3.0, np.nan, 9.0), ... (np.nan, 4.0, -4.0, 16.0)], ... columns=list('abcd')) >>> df a b c d 0 0.0 NaN -1.0 1.0 1 NaN 2.0 NaN NaN 2 2.0 3.0 NaN 9.0 3 NaN 4.0 -4.0 16.0 >>> df.interpolate(method='linear') a b c d 0 0.0 NaN -1.0 1.0 1 1.0 2.0 -2.0 5.0 2 2.0 3.0 -3.0 9.0 3 2.0 4.0 -4.0 16.0