Synthesize time-series data
Use YData's TimeSeriesSynthesizer to generate time-series synthetic data
Tabular data is the most common type of data we encounter in data problems.
When thinking about tabular data, we assume independence between different records, but this does not happen in reality. Suppose we check events from our day-to-day life, such as room temperature changes, bank account transactions, stock price fluctuations, and air quality measurements in our neighborhood. In that case, we might end up with datasets where measures and records evolve and are related through time. This type of data is known to be sequential or time-series data.
Thus, sequential or time-series data refers to any data containing elements ordered into sequences in a structured format. Dissecting any time-series dataset, we see differences in variables' behavior that need to be understood for an effective generation of synthetic data. Typically any time-series dataset is composed of the following:
- Variables that define the order of time (these can be simple with one variable or composed)
- Time-variant variables
- Variables that refer to entities (single or multiple entities)
- Variables that are attributes (those that don't depend on time but rather on the entity)
For a more detailed tutorial please check YData Fabric Academy ydata-sdk notebooks.
Below find an example:
# -*- coding: utf-8 -*-
# Authentication
import os
from ydata.sdk.dataset import get_dataset
from ydata.sdk.synthesizers import TimeSeriesSynthesizer
# Do not forget to add your token as env variable
os.environ["YDATA_TOKEN"] = '{insert-token}'
# Sampling an example dataset for a multientity & multivariate time-series dataset"""
# Generate the dataset
time_series_data = get_dataset('timeseries')
# Print the first few rows of the dataset
print(time_series_data.head())
# Train a Synthetic data generator
# From a pandas dataframe
# We initialize a time series synthesizer
# As long as the synthesizer does not call `fit`, it exists only locally
synth = TimeSeriesSynthesizer(name='Time-series synth')
# We train the synthesizer on our dataset
# sortbykey -> variable that define the time order for the sequence
synth.fit(time_series_data, sortbykey='time', entities='entity_id')
# Generate samples from an already trained synthesizer
# From the synthesizer in context in the notebook
# Generate a sample with x number of entities
# In this example the objective is to generate a dataset with the same size as the original. For that reason, 5 entities will be generated.
sample = synth.sample(n_entities=5)
sample.head()
# From a previously trained synthetic data generation model
# List the trained synthetic data generators to get the uid synthetisizer
TimeSeriesSynthesizer.list()
synth = TimeSeriesSynthesizer(uid='{insert-synth-id}').get()
# Generate a new synthetic dataset with the sample method
sample = synth.sample(n_entities=5)
sample.head()