Quickstart
YData SDK allows you to with an easy and familiar interface, to adopt a Data-Centric AI approach for the development of Machine Learning solutions. YData SDK features were designed to support structure data, including tabular data, time-series and transactional data.
Read data
To start leveraging the package features you should consume your data either through the Connectors or pandas.Dataframe. The list of available connectors can be found here [add a link].
# Example for a Google Cloud Storage Connector
credentials = "{insert-credentials-file-path}"
# We create a new connector for Google Cloud Storage
connector = Connector(connector_type='gcs', credentials=credentials)
# Create a Datasource from the connector
# Note that a connector can be re-used for several datasources
X = DataSource(connector=connector, path='gs://<my_bucket>.csv')
The synthesis process returns a pandas.DataFrame
object.
Note that if you are using the ydata-fabric-sdk
free version, all of your data is sent to a remote cluster on YData's infrastructure.
Data synthesis flow
The process of data synthesis can be described into the following steps:
stateDiagram-v2
state read_data
read_data --> init_synth
init_synth --> train_synth
train_synth --> generate_samples
generate_samples --> [*]
The code snippet below shows how easy can be to start generating new synthetic data. The package includes a set of examples datasets for a quickstart.
from ydata.sdk.dataset import get_dataset
#read the example data
X = get_dataset('census')
# Init a synthesizer
synth = RegularSynthesizer()
# Fit the synthesizer to the input data
synth.fit(X)
# Sample new synthetic data. The below request ask for new 1000 synthetic rows
synth.sample(n_samples=1000)
Do I need to prepare my data before synthesis?
The sdk ensures that the original behaviour is replicated. For that reason, there is no need to preprocess outlier observations or missing data.
By default all the missing data is replicated as NaN.