Compute Functions

Datum class

class Datum

Variant type for various Arrow C++ data structures.

Public Functions

Datum()

Empty datum, to be populated elsewhere.

bool is_value() const

True if Datum contains a scalar or array-like data.

ValueDescr descr() const

Return the shape (array or scalar) and type for supported kinds (ARRAY, CHUNKED_ARRAY, and SCALAR).

Debug asserts otherwise

ValueDescr::Shape shape() const

Return the shape (array or scalar) for supported kinds (ARRAY, CHUNKED_ARRAY, and SCALAR).

Debug asserts otherwise

std::shared_ptr<DataType> type() const

The value type of the variant, if any.

Return

nullptr if no type

std::shared_ptr<Schema> schema() const

The schema of the variant, if any.

Return

nullptr if no schema

int64_t length() const

The value length of the variant, if any.

Return

kUnknownLength if no type

ArrayVector chunks() const

The array chunks of the variant, if any.

Return

empty if not arraylike

Abstract Function classes

struct FunctionOptions
#include <arrow/compute/function.h>

Base class for specifying options configuring a function’s behavior, such as error handling.

Subclassed by arrow::compute::ArithmeticOptions, arrow::compute::ArraySortOptions, arrow::compute::CastOptions, arrow::compute::CompareOptions, arrow::compute::CountOptions, arrow::compute::DictionaryEncodeOptions, arrow::compute::FilterOptions, arrow::compute::MatchSubstringOptions, arrow::compute::MinMaxOptions, arrow::compute::ModeOptions, arrow::compute::PartitionNthOptions, arrow::compute::ProjectOptions, arrow::compute::QuantileOptions, arrow::compute::ReplaceSubstringOptions, arrow::compute::SetLookupOptions, arrow::compute::SortOptions, arrow::compute::SplitOptions, arrow::compute::StrptimeOptions, arrow::compute::TDigestOptions, arrow::compute::TakeOptions, arrow::compute::TrimOptions, arrow::compute::VarianceOptions

struct Arity
#include <arrow/compute/function.h>

Contains the number of required arguments for the function.

Naming conventions taken from https://en.wikipedia.org/wiki/Arity.

Public Members

int num_args

The number of required arguments (or the minimum number for varargs functions).

bool is_varargs = false

If true, then the num_args is the minimum number of required arguments.

Public Static Functions

static Arity Nullary()

A function taking no arguments.

static Arity Unary()

A function taking 1 argument.

static Arity Binary()

A function taking 2 arguments.

static Arity Ternary()

A function taking 3 arguments.

static Arity VarArgs(int min_args = 0)

A function taking a variable number of arguments.

Parameters
  • [in] min_args: the minimum number of arguments required when invoking the function

struct FunctionDoc
#include <arrow/compute/function.h>

Public Members

std::string summary

A one-line summary of the function, using a verb.

For example, “Add two numeric arrays or scalars”.

std::string description

A detailed description of the function, meant to follow the summary.

std::vector<std::string> arg_names

Symbolic names (identifiers) for the function arguments.

Some bindings may use this to generate nicer function signatures.

std::string options_class

Name of the options class, if any.

class Function
#include <arrow/compute/function.h>

Base class for compute functions.

Function implementations contain a collection of “kernels” which are implementations of the function for specific argument types. Selecting a viable kernel for executing a function is referred to as “dispatching”.

Subclassed by arrow::compute::detail::FunctionImpl< VectorKernel >, arrow::compute::detail::FunctionImpl< ScalarKernel >, arrow::compute::detail::FunctionImpl< ScalarAggregateKernel >, arrow::compute::detail::FunctionImpl< HashAggregateKernel >, arrow::compute::MetaFunction, arrow::compute::detail::FunctionImpl< KernelType >

Public Types

compute-functions::Kind

The kind of function, which indicates in what contexts it is valid for use.

Values:

A function that performs scalar data operations on whole arrays of data.

Can generally process Array or Scalar values. The size of the output will be the same as the size (or broadcasted size, in the case of mixing Array and Scalar inputs) of the input.

A function with array input and output whose behavior depends on the values of the entire arrays passed, rather than the value of each scalar value.

A function that computes scalar summary statistics from array input.

A function that computes grouped summary statistics from array input and an array of group identifiers.

A function that dispatches to other functions and does not contain its own kernels.

Public Functions

const std::string &name() const

The name of the kernel. The registry enforces uniqueness of names.

Function::Kind kind() const

The kind of kernel, which indicates in what contexts it is valid for use.

const Arity &arity() const

Contains the number of arguments the function requires, or if the function accepts variable numbers of arguments.

const FunctionDoc &doc() const

Return the function documentation.

virtual int num_kernels() const = 0

Returns the number of registered kernels for this function.

virtual Result<const Kernel *> DispatchExact(const std::vector<ValueDescr> &values) const

Return a kernel that can execute the function given the exact argument types (without implicit type casts or scalar->array promotions).

NB: This function is overridden in CastFunction.

virtual Result<const Kernel *> DispatchBest(std::vector<ValueDescr> *values) const

Return a best-match kernel that can execute the function given the argument types, after implicit casts are applied.

Parameters
  • [inout] values: Argument types. An element may be modified to indicate that the returned kernel only approximately matches the input value descriptors; callers are responsible for casting inputs to the type and shape required by the kernel.

virtual Result<Datum> Execute(const std::vector<Datum> &args, const FunctionOptions *options, ExecContext *ctx) const

Execute the function eagerly with the passed input arguments with kernel dispatch, batch iteration, and memory allocation details taken care of.

If the options pointer is null, then default_options() will be used.

This function can be overridden in subclasses.

const FunctionOptions *default_options() const

Returns a the default options for this function.

Whatever option semantics a Function has, implementations must guarantee that default_options() is valid to pass to Execute as options.

class ScalarFunction : public arrow::compute::detail::FunctionImpl<ScalarKernel>
#include <arrow/compute/function.h>

A function that executes elementwise operations on arrays or scalars, and therefore whose results generally do not depend on the order of the values in the arguments.

Accepts and returns arrays that are all of the same size. These functions roughly correspond to the functions used in SQL expressions.

Subclassed by arrow::compute::CastFunction

Public Functions

Status AddKernel(std::vector<InputType> in_types, OutputType out_type, ArrayKernelExec exec, KernelInit init = NULLPTR)

Add a kernel with given input/output types, no required state initialization, preallocation for fixed-width types, and default null handling (intersect validity bitmaps of inputs).

Status AddKernel(ScalarKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class VectorFunction : public arrow::compute::detail::FunctionImpl<VectorKernel>
#include <arrow/compute/function.h>

A function that executes general array operations that may yield outputs of different sizes or have results that depend on the whole array contents.

These functions roughly correspond to the functions found in non-SQL array languages like APL and its derivatives.

Public Functions

Status AddKernel(std::vector<InputType> in_types, OutputType out_type, ArrayKernelExec exec, KernelInit init = NULLPTR)

Add a simple kernel with given input/output types, no required state initialization, no data preallocation, and no preallocation of the validity bitmap.

Status AddKernel(VectorKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class ScalarAggregateFunction : public arrow::compute::detail::FunctionImpl<ScalarAggregateKernel>
#include <arrow/compute/function.h>

Public Functions

Status AddKernel(ScalarAggregateKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class HashAggregateFunction : public arrow::compute::detail::FunctionImpl<HashAggregateKernel>
#include <arrow/compute/function.h>

Public Functions

Status AddKernel(HashAggregateKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class MetaFunction : public arrow::compute::Function
#include <arrow/compute/function.h>

A function that dispatches to other functions.

Must implement MetaFunction::ExecuteImpl.

For Array, ChunkedArray, and Scalar Datum kinds, may rely on the execution of concrete Function types, but must handle other Datum kinds on its own.

Public Functions

int num_kernels() const

Returns the number of registered kernels for this function.

Result<Datum> Execute(const std::vector<Datum> &args, const FunctionOptions *options, ExecContext *ctx) const

Execute the function eagerly with the passed input arguments with kernel dispatch, batch iteration, and memory allocation details taken care of.

If the options pointer is null, then default_options() will be used.

This function can be overridden in subclasses.

Function registry

class FunctionRegistry

A mutable central function registry for built-in functions as well as user-defined functions.

Functions are implementations of arrow::compute::Function.

Generally, each function contains kernels which are implementations of a function for a specific argument signature. After looking up a function in the registry, one can either execute it eagerly with Function::Execute or use one of the function’s dispatch methods to pick a suitable kernel for lower-level function execution.

Public Functions

Status AddFunction(std::shared_ptr<Function> function, bool allow_overwrite = false)

Add a new function to the registry.

Returns Status::KeyError if a function with the same name is already registered

Status AddAlias(const std::string &target_name, const std::string &source_name)

Add aliases for the given function name.

Returns Status::KeyError if the function with the given name is not registered

Result<std::shared_ptr<Function>> GetFunction(const std::string &name) const

Retrieve a function by name from the registry.

std::vector<std::string> GetFunctionNames() const

Return vector of all entry names in the registry.

Helpful for displaying a manifest of available functions

int num_functions() const

The number of currently registered functions.

Public Static Functions

static std::unique_ptr<FunctionRegistry> Make()

Construct a new registry.

Most users only need to use the global registry

FunctionRegistry *arrow::compute::GetFunctionRegistry()

Return the process-global function registry.

Convenience functions

Result<Datum> arrow::compute::CallFunction(const std::string &func_name, const std::vector<Datum> &args, const FunctionOptions *options, ExecContext *ctx = NULLPTR)

One-shot invoker for all types of functions.

Does kernel dispatch, argument checking, iteration of ChunkedArray inputs, and wrapping of outputs.

Result<Datum> arrow::compute::CallFunction(const std::string &func_name, const std::vector<Datum> &args, ExecContext *ctx = NULLPTR)

Variant of CallFunction which uses a function’s default options.

NB: Some functions require FunctionOptions be provided.

Concrete options classes

compute-concrete-options::CompareOperator

Values:

compute-concrete-options::SortOrder

Values:

struct CountOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Count kernel behavior.

By default, all non-null values are counted.

Public Types

compute-concrete-options::Mode

Values:

Count all non-null values.

Count all null values.

Public Functions

CountOptions(enum Mode count_mode = COUNT_NON_NULL)

Public Members

Mode count_mode

Public Static Functions

static CountOptions Defaults()
struct MinMaxOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control MinMax kernel behavior.

By default, null values are ignored

Public Types

compute-concrete-options::Mode

Values:

Skip null values.

Any nulls will result in null output.

Public Functions

MinMaxOptions(enum Mode null_handling = SKIP)

Public Members

Mode null_handling

Public Static Functions

static MinMaxOptions Defaults()
struct ModeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Mode kernel behavior.

Returns top-n common values and counts. By default, returns the most common value and count.

Public Functions

ModeOptions(int64_t n = 1)

Public Members

int64_t n = 1

Public Static Functions

static ModeOptions Defaults()
struct VarianceOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel.

The divisor used in calculations is N - ddof, where N is the number of elements. By default, ddof is zero, and population variance or stddev is returned.

Public Functions

VarianceOptions(int ddof = 0)

Public Members

int ddof = 0

Public Static Functions

static VarianceOptions Defaults()
struct QuantileOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Quantile kernel behavior.

By default, returns the median value.

Public Types

compute-concrete-options::Interpolation

Interpolation method to use when quantile lies between two data points.

Values:

Public Functions

QuantileOptions(double q = 0.5, enum Interpolation interpolation = LINEAR)
QuantileOptions(std::vector<double> q, enum Interpolation interpolation = LINEAR)

Public Members

std::vector<double> q

quantile must be between 0 and 1 inclusive

Interpolation interpolation

Public Static Functions

static QuantileOptions Defaults()
struct TDigestOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control TDigest approximate quantile kernel behavior.

By default, returns the median value.

Public Functions

TDigestOptions(double q = 0.5, uint32_t delta = 100, uint32_t buffer_size = 500)
TDigestOptions(std::vector<double> q, uint32_t delta = 100, uint32_t buffer_size = 500)

Public Members

std::vector<double> q

quantile must be between 0 and 1 inclusive

uint32_t delta

compression parameter, default 100

uint32_t buffer_size

input buffer size, default 500

Public Static Functions

static TDigestOptions Defaults()
struct ArithmeticOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

ArithmeticOptions()

Public Members

bool check_overflow
struct MatchSubstringOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

MatchSubstringOptions(std::string pattern)

Public Members

std::string pattern

The exact substring (or regex, depending on kernel) to look for inside input values.

struct SplitOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Subclassed by arrow::compute::SplitPatternOptions

Public Functions

SplitOptions(int64_t max_splits = -1, bool reverse = false)

Public Members

int64_t max_splits

Maximum number of splits allowed, or unlimited when -1.

bool reverse

Start splitting from the end of the string (only relevant when max_splits != -1)

struct SplitPatternOptions : public arrow::compute::SplitOptions
#include <arrow/compute/api_scalar.h>

Public Functions

SplitPatternOptions(std::string pattern, int64_t max_splits = -1, bool reverse = false)

Public Members

std::string pattern

The exact substring to look for inside input values.

struct ReplaceSubstringOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

ReplaceSubstringOptions(std::string pattern, std::string replacement, int64_t max_replacements = -1)

Public Members

std::string pattern

Pattern to match, literal, or regular expression depending on which kernel is used.

std::string replacement

String to replace the pattern with.

int64_t max_replacements

Max number of substrings to replace (-1 means unbounded)

struct SetLookupOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Options for IsIn and IndexIn functions.

Public Functions

SetLookupOptions(Datum value_set, bool skip_nulls = false)

Public Members

Datum value_set

The set of values to look up input values into.

bool skip_nulls

Whether nulls in value_set count for lookup.

If true, any null in value_set is ignored and nulls in the input produce null (IndexIn) or false (IsIn) values in the output. If false, any null in value_set is successfully matched in the input.

struct StrptimeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

StrptimeOptions(std::string format, TimeUnit::type unit)

Public Members

std::string format
TimeUnit::type unit
struct TrimOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

TrimOptions(std::string characters)

Public Members

std::string characters

The individual characters that can be trimmed from the string.

struct CompareOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

CompareOptions(CompareOperator op)

Public Members

CompareOperator op
struct ProjectOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

ProjectOptions(std::vector<std::string> n, std::vector<bool> r, std::vector<std::shared_ptr<const KeyValueMetadata>> m)
ProjectOptions(std::vector<std::string> n)

Public Members

std::vector<std::string> field_names

Names for wrapped columns.

std::vector<bool> field_nullability

Nullability bits for wrapped columns.

std::vector<std::shared_ptr<const KeyValueMetadata>> field_metadata

Metadata attached to wrapped columns.

struct FilterOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Types

compute-concrete-options::NullSelectionBehavior

Configure the action taken when a slot of the selection mask is null.

Values:

the corresponding filtered value will be removed in the output

the corresponding filtered value will be null in the output

Public Functions

FilterOptions(NullSelectionBehavior null_selection = DROP)

Public Members

NullSelectionBehavior null_selection_behavior = DROP

Public Static Functions

static FilterOptions Defaults()
struct TakeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Functions

TakeOptions(bool boundscheck = true)

Public Members

bool boundscheck = true

Public Static Functions

static TakeOptions BoundsCheck()
static TakeOptions NoBoundsCheck()
static TakeOptions Defaults()
struct DictionaryEncodeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Options for the dictionary encode function.

Public Types

compute-concrete-options::NullEncodingBehavior

Configure how null values will be encoded.

Values:

the null value will be added to the dictionary with a proper index

the null value will be masked in the indices array

Public Functions

DictionaryEncodeOptions(NullEncodingBehavior null_encoding = MASK)

Public Members

NullEncodingBehavior null_encoding_behavior = MASK

Public Static Functions

static DictionaryEncodeOptions Defaults()
struct SortKey
#include <arrow/compute/api_vector.h>

One sort key for PartitionNthIndices (TODO) and SortIndices.

Public Functions

SortKey(std::string name, SortOrder order = SortOrder::Ascending)

Public Members

std::string name

The name of the sort column.

SortOrder order

How to order by this sort key.

struct ArraySortOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Functions

ArraySortOptions(SortOrder order = SortOrder::Ascending)

Public Members

SortOrder order

Public Static Functions

static ArraySortOptions Defaults()
struct SortOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Functions

SortOptions(std::vector<SortKey> sort_keys = {})

Public Members

std::vector<SortKey> sort_keys

Public Static Functions

static SortOptions Defaults()
struct PartitionNthOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Partitioning options for NthToIndices.

Public Functions

PartitionNthOptions(int64_t pivot)

Public Members

int64_t pivot

The index into the equivalent sorted array of the partition pivot element.

struct CastOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/cast.h>

Public Functions

CastOptions(bool safe = true)

Public Members

std::shared_ptr<DataType> to_type
bool allow_int_overflow
bool allow_time_truncate
bool allow_time_overflow
bool allow_decimal_truncate
bool allow_float_truncate
bool allow_invalid_utf8

Public Static Functions

static CastOptions Safe(std::shared_ptr<DataType> to_type = NULLPTR)
static CastOptions Unsafe(std::shared_ptr<DataType> to_type = NULLPTR)