Skip to content

surveys.py

FUNCTION DESCRIPTION
validate

Validate a survey DataFrame for required columns and acceptable value ranges.

ATTRIBUTE DESCRIPTION
ATTRIBUTES_COLUMN_NAMES

List of allowed column names for survey data.

ATTRIBUTES_VALUES

Allowed inclusive range for attribute values.

ATTRIBUTES_COLUMN_NAMES module-attribute

ATTRIBUTES_COLUMN_NAMES = ['pleasant', 'present', 'light', 'engaging', 'unpleasant', 'absent', 'overpowering', 'detached']

List of allowed column names for survey data. The survey DataFrame must include these columns.

ATTRIBUTES_VALUES module-attribute

ATTRIBUTES_VALUES = [1, 2, 3, 4, 5]

Allowed inclusive range for attribute values. Values can range from 1 (minimum) to 5 (maximum).

validate

validate(df)

Validate a survey DataFrame for required columns and acceptable value ranges.

This function ensures that the provided DataFrame contains all necessary survey and ID columns, and that all numeric responses fall within the defined set ATTRIBUTES_VALUES.

PARAMETER DESCRIPTION
df

A DataFrame containing survey data. Must include columns listed in ATTRIBUTES_COLUMN_NAMES.

TYPE: DataFrame

RETURNS DESCRIPTION
df

A DataFrame containing only valid rows with complete data.

TYPE: DataFrame

excl_df

A DataFrame containing rows removed due to missing values.

TYPE: DataFrame

RAISES DESCRIPTION
Exception "Missing mandatory column/s"

If any required columns are missing

Exception "Attribute values are not valid"

If any numeric responses fall outside the allowed range.

Examples:

>>> import pandas as pd
>>> from smellscapy.databases.DataExample import load_example_data
>>> from smellscapy.surveys import validate
>>> df = load_example_data()
>>> df, excl_df = validate(df) # passes without error
Source code in src/smellscapy/surveys.py
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def validate(
    df: pd.DataFrame,
) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
    """ 
    Validate a survey DataFrame for required columns and acceptable value ranges.

    This function ensures that the provided DataFrame contains all necessary
    survey and ID columns, and that all numeric responses fall within the
    defined set `ATTRIBUTES_VALUES`.

    Parameters
    ----------
    df : pandas.DataFrame
        A DataFrame containing survey data. Must include columns listed in
        `ATTRIBUTES_COLUMN_NAMES`.

    Returns
    -------
    df : pandas.DataFrame
        A DataFrame containing only valid rows with complete data.
    excl_df : pandas.DataFrame
        A DataFrame containing rows removed due to missing values.

    Raises
    ------
    Exception "Missing mandatory column/s"
        If any required columns are missing

    Exception "Attribute values are not valid" 
        If any numeric responses fall outside the allowed range.


    Examples
    --------

    >>> import pandas as pd
    >>> from smellscapy.databases.DataExample import load_example_data
    >>> from smellscapy.surveys import validate
    >>> df = load_example_data()
    >>> df, excl_df = validate(df) # passes without error

    """
    logger.info("Validating data...")

    df = df.copy()
    # verificare che chh,tte le colonne 
    missing_columns = []
    for col in ATTRIBUTES_COLUMN_NAMES:
        if not col in df.columns:
            missing_columns.append(col)

    if missing_columns:
        raise Exception(f"Missing mandatory column/s: {', '.join(missing_columns)}")


    df_attr = df[ATTRIBUTES_COLUMN_NAMES]
    df_err = df_attr.applymap(lambda x: x in ATTRIBUTES_VALUES if pd.notna(x) else True)
    if any(df_err.eq(False).any()):
        raise Exception("Attribute values are not valid. Please use numbers in range [1, 2, 3, 4, 5]")

    # verificare attributes
    invalid_indices = []
    for i, row in df_attr.iterrows():
        if  row.isna().any():
            invalid_indices.append(i)

    if invalid_indices:
        excl_df = df.iloc[invalid_indices]
        df = df.drop(df.index[invalid_indices])
        logger.info(f"Found {len(invalid_indices)} samples with missing data")
        logger.info(f"Removed {len(invalid_indices)} rows with invalid data")
    else:
        excl_df = None
        logger.info("All data passed quality checks")

    return df, excl_df