Skip to content

descriptive_analysis.py

FUNCTION DESCRIPTION
descriptive_statistics

Generate descriptive statistics.

descriptive_statistics

descriptive_statistics(df, group_by_col=None)

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

Analyzes the columns "pleasatness_score" and "presence_score" of the database (numerical continous variables between [-1,1])

PARAMETER DESCRIPTION
df

A DataFrame containing "pleasantness_score" and "presence_score".

TYPE: DataFrame

**kwargs

TYPE: dict, optional**

Additional
  • group_by_col : str or None, optional Name of the column in df to be used as categorical grouping variable. If None or not present in df, a comprehensive statistical description is computed.

RETURNS DESCRIPTION
s

Summary statistics of the DataFrame provided.

TYPE: DataFrame

Examples:

>>> from smellscapy.databases.DataExample import load_example_data
>>> from smellscapy.surveys import validate
>>> df = load_example_data()
>>> df,_ = validate(df)
>>> s = descriptive_statistics(df)
        pleasantness_score  presence_score
count              39.000000       39.000000
mean                0.090199       -0.001022
std                 0.164035        0.177936
min                -0.353553       -0.353553
25%                 0.000000       -0.088388
std                 0.164035        0.177936
min                -0.353553       -0.353553
25%                 0.000000       -0.088388
50%                 0.103553        0.000000
50%                 0.103553        0.000000
75%                 0.176777        0.133883
max                 0.573223        0.280330
median              0.103553        0.000000
variance            0.026907        0.031661
skewness            0.056743       -0.453332
kurtosis            1.584875       -0.471179
Source code in src/smellscapy/analysis/descriptive_analysis.py
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def descriptive_statistics (df, group_by_col=None):
    """
    Generate descriptive statistics.

    Descriptive statistics include those that summarize the central
    tendency, dispersion and shape of a
    dataset's distribution, excluding ``NaN`` values.

    Analyzes the columns "pleasatness_score" and "presence_score" of the database (numerical continous variables between [-1,1])

    Parameters
    ----------
    df : pd.DataFrame
        A DataFrame containing "pleasantness_score" and "presence_score".

    **kwargs : dict, optional** 
    Additional keyword arguments to override default plotting parameters, including:
        - `group_by_col` : str or None, optional
        Name of the column in ``df`` to be used as categorical grouping
        variable. If None or not present in ``df``, a comprehensive statistical description is computed.

    Returns
    -------
    s : DataFrame
        Summary statistics of the DataFrame provided.


    Examples
    --------

    >>> from smellscapy.databases.DataExample import load_example_data
    >>> from smellscapy.surveys import validate
    >>> df = load_example_data()
    >>> df,_ = validate(df)
    >>> s = descriptive_statistics(df)
            pleasantness_score  presence_score
    count              39.000000       39.000000
    mean                0.090199       -0.001022
    std                 0.164035        0.177936
    min                -0.353553       -0.353553
    25%                 0.000000       -0.088388
    std                 0.164035        0.177936
    min                -0.353553       -0.353553
    25%                 0.000000       -0.088388
    50%                 0.103553        0.000000
    50%                 0.103553        0.000000
    75%                 0.176777        0.133883
    max                 0.573223        0.280330
    median              0.103553        0.000000
    variance            0.026907        0.031661
    skewness            0.056743       -0.453332
    kurtosis            1.584875       -0.471179   


    """


    if group_by_col is not None and group_by_col in df.columns:
        df_subgroups = df.groupby(group_by_col)
        s = pd.DataFrame(columns= ["type", 'pleasantness_score', 'presence_score', "subgroup"])
        for name, subgroup in df_subgroups:
            df_temp = subgroup[['pleasantness_score', 'presence_score']]
            s1 = df_temp.describe()
            s1.loc['median'] = df_temp.median()
            s1.loc['variance'] = df_temp.var() 
            s1.loc['skewness'] = df_temp.skew()
            s1.loc['kurtosis'] = df_temp.kurtosis()
            s1["type"] = s1.index
            s1["subgroup"] = name
            print(group_by_col)
            print(name)
            s= pd.concat([s, s1], ignore_index=True)


    else:
        df_temp = df[['pleasantness_score', 'presence_score']]
        s = df_temp.describe()
        s.loc['median'] = df_temp.median()
        s.loc['variance'] = df_temp.var() 
        s.loc['skewness'] = df_temp.skew()
        s.loc['kurtosis'] = df_temp.kurtosis()

    print(s)

    s.to_csv("descriptive_statistics.csv")

    return s