RZT aiOS blocks, irrespective of pre-built or custom, can be executed either individually as a single block or as part of a pipeline. Other aspects, a user is concerned about, while running a block is the execution environment and the mechanism by which data is moved from one block to another. In this document, we talk about different execution environments in detail. To learn about different data transfer methods see data transport between blocks. This tutorial starts from how to run a block in the simplest environment like a single threaded process to complex environments like spark and horovod.
A block can be run in different types of execution environments like
TheadExecutor
SuprocessExecutor
Specialization of ThreadExecutor
ProcessExecutor
Specialization of ThreadExecutor
ContainerExecutor
SparkExecutor
Specialization of ContainerExecutor
HorovodExecutor
Specialization of ContainerExecutor
By default, when no container is specified, a block will be executed as a subprocess forked from the Jupyter kernel process. This is ideal and
quick for trying out small prototypical code.
The CsvReader
block in the below code reads a file in multiple chunks from project space using pandas and outputs the shape of each chunk of data.
import razor.flow as rf
from razor.api import project_space_path
import pandas as pd
@rf.block
class CsvReader:
filename: str
output:rf.SeriesOutput[tuple]
def run(self):
file_path = project_space_path(self.filename)
chunks = pd.read_csv(file_path, chunksize=100, nrows=None, delimiter = None)
for df in chunks:
self.output.put(df.shape)
csv_reader = CsvReader("Read csv file", filename="titanic/train.csv")
The above example works well for small files (less than 100 Mega Bytes). For larger files, one might want to assign more cpu cores and memory. RZT aiOS provides a ContainerExecutor
in which one can assign more cpu and memory
csv_reader.filename = "mnist/mnist_train.csv"
csv_reader.executor = rf.ContainerExecutor(cores=2, memory=1000)
csv_reader.execute()
RZT aiOS allows one to configure and add distributed environments like Apache Spark and Horovod for larger data processing tasks. To learn more about how to use a spark engine for running the pyspark code see section Building and running a spark block. Support for Horovod is not available in the current release.