Functions
Reading Data
DataFrames
What are DataFrames?
DataFrames are 2 dimensional objects (tables) that hold your data so you can transform it in any whay you see fit. You can think of DataFrames as more powerful version of Tables in Excel.
How do you create DataFrames?
There are 2 ways to create a DataFrame:
- Use the
read_*functions. These functions will allow you to read your data in from a variety of different formats to include: - Use the
collectfunction when working with a LazyFrame
Info
Seeing all the data formats may be overwhelming especially if you are used to working in Excel. Just start with what you know and expand to using different formats if you need to. You very well may never need to use every format provided.
LazyFrames
What are LazyFrames?
LazyFrames are Tables whose data you stream into Pypdex as opposed to reading them directly in like a DataFrame.
What is Streaming Data?
Streaming data is the process where the computer is able to process Transformation Steps on chunks of data instead of trying to do everything all at once.
This allows you to select, filter, group_by and aggregate before your data gets read into memory and then process it in chunks.
Tip
LazyFrames are the only way to work with datasets that are larger than the RAM on your computer.
Note
You can turn LazyFrames into DataFrames by adding the collect function. collect tells the computer to read the data based on whatever Transformation Steps you have created.
How do you create LazyFrames?
LazyFrames can be created using the scan_* functions. The following file formats can be scanned:
- CSV
- Delta Tables
- Iceberg
- IPC
- NDJSON
- Parquet
Exporting Data
DataFrames
How do you export data?
Data can be exported using the write_* functions. These functions provide all the same file formats as mentioned above.
LazyFrames
Converting file types
If all you are interested in is saving off the output of a LazyFrame then you can use the sink_* functions to do just that. These functions will allow you to stream the reading, transformation and exporting of your data into the following formats: