📊 Phân Tích Dữ Liệu với Python - Data Analysis

Tóm tắt: Hướng dẫn toàn diện về phân tích dữ liệu với Python, từ cơ bản đến nâng cao, bao gồm pandas, NumPy, matplotlib và các thư viện quan trọng khác.

🎯 Tại Sao Cần Phân Tích Dữ Liệu?

Hãy tưởng tượng bạn là một thám tử 🕵️‍♂️, và dữ liệu là những manh mối. Phân tích dữ liệu giúp bạn:

Khám phá patterns - Tìm ra quy luật ẩn
Đưa ra quyết định - Dựa trên evidence thay vì cảm tính
Predict tương lai - Dự đoán xu hướng
Tối ưu performance - Cải thiện hiệu quả

Python là công cụ tuyệt vời cho data analysis vì có ecosystem phong phú và syntax đơn giản!

📦 1. ESSENTIAL LIBRARIES

Cài Đặt Môi Trường

# Essential data science packages
pip install pandas numpy matplotlib seaborn scipy
pip install jupyter plotly scikit-learn

# Optional advanced packages  
pip install statsmodels openpyxl xlsxwriter requests beautifulsoup4

Import Statement Standards

# Standard imports for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Đã import thành công các thư viện cần thiết!")

🐼 2. PANDAS ESSENTIALS

DataFrame Basics

# Tạo DataFrame từ các nguồn khác nhau
def create_sample_data():
    """Tạo dữ liệu mẫu về học sinh."""
    
    # Từ dictionary
    student_data = {
        'student_id': ['SV001', 'SV002', 'SV003', 'SV004', 'SV005'],
        'name': ['Nguyễn Văn An', 'Trần Thị Bình', 'Lê Hoàng Cường', 
                'Phạm Thị Dung', 'Hoàng Văn Em'],
        'age': [20, 19, 21, 20, 22],
        'gender': ['M', 'F', 'M', 'F', 'M'],
        'department': ['IT', 'Business', 'IT', 'Engineering', 'Business'],
        'gpa': [8.5, 9.2, 7.8, 8.9, 7.5],
        'credits': [45, 52, 38, 48, 41]
    }
    
    df = pd.DataFrame(student_data)
    
    # Thêm thông tin dates
    df['enrollment_date'] = pd.date_range('2020-09-01', periods=5, freq='30D')
    
    return df

# Tạo và khám phá DataFrame
students_df = create_sample_data()

print("=== BASIC INFO ===")
print(f"Shape: {students_df.shape}")
print(f"Columns: {list(students_df.columns)}")
print("\n=== FIRST 3 ROWS ===")
print(students_df.head(3))

print("\n=== DATA TYPES ===")
print(students_df.dtypes)

print("\n=== BASIC STATISTICS ===")
print(students_df.describe())

Data Loading & Saving

def demonstrate_data_io():
    """Demo các cách load và save dữ liệu."""
    
    # 1. CSV Files
    # Save to CSV
    students_df.to_csv('students.csv', index=False, encoding='utf-8')
    
    # Load from CSV
    loaded_df = pd.read_csv('students.csv', encoding='utf-8')
    print("✅ CSV loaded successfully")
    
    # 2. Excel Files
    # Save to Excel với multiple sheets
    with pd.ExcelWriter('students_data.xlsx', engine='xlsxwriter') as writer:
        students_df.to_excel(writer, sheet_name='Students', index=False)
        
        # Tạo summary sheet
        summary = students_df.groupby('department').agg({
            'gpa': ['mean', 'max', 'min'],
            'age': 'mean',
            'credits': 'sum'
        }).round(2)
        summary.to_excel(writer, sheet_name='Summary')
    
    # Load from Excel
    excel_df = pd.read_excel('students_data.xlsx', sheet_name='Students')
    print("✅ Excel loaded successfully")
    
    # 3. JSON Files
    # Save to JSON
    students_df.to_json('students.json', orient='records', indent=2)
    
    # Load from JSON
    json_df = pd.read_json('students.json')
    print("✅ JSON loaded successfully")
    
    # Cleanup files
    import os
    for file in ['students.csv', 'students_data.xlsx', 'students.json']:
        if os.path.exists(file):
            os.remove(file)

# demonstrate_data_io()

Data Selection & Filtering

def data_selection_examples():
    """Các cách select và filter dữ liệu."""
    
    df = create_sample_data()
    
    print("=== COLUMN SELECTION ===")
    # Single column
    names = df['name']
    print(f"Names: {names.tolist()}")
    
    # Multiple columns
    basic_info = df[['name', 'department', 'gpa']]
    print("\nBasic info:")
    print(basic_info)
    
    print("\n=== ROW FILTERING ===")
    # Conditional filtering
    high_gpa = df[df['gpa'] >= 8.5]
    print("Students with GPA >= 8.5:")
    print(high_gpa[['name', 'gpa']])
    
    # Multiple conditions
    it_students = df[(df['department'] == 'IT') & (df['age'] >= 20)]
    print("\nIT students aged 20+:")
    print(it_students[['name', 'age', 'department']])
    
    # Using isin()
    target_depts = ['IT', 'Engineering']
    tech_students = df[df['department'].isin(target_depts)]
    print(f"\nStudents in {target_depts}:")
    print(tech_students[['name', 'department']])
    
    print("\n=== STRING FILTERING ===")
    # String contains
    students_with_nguyen = df[df['name'].str.contains('Nguyễn')]
    print("Students with 'Nguyễn' in name:")
    print(students_with_nguyen[['name']])
    
    # String startswith
    female_students = df[df['name'].str.contains('Thị')]
    print("\nFemale students (có 'Thị'):")
    print(female_students[['name', 'gender']])

# data_selection_examples()

Data Cleaning

def data_cleaning_examples():
    """Các kỹ thuật data cleaning."""
    
    # Tạo messy data
    messy_data = {
        'student_id': ['SV001', 'SV002', None, 'SV004', 'SV005', 'SV002'],  # Duplicate và missing
        'name': ['  NGUYỄN VĂN AN  ', 'trần thị bình', 'LÊ HOÀNG cường', None, 'Phạm Thị Dung', '  NGUYỄN VĂN AN  '],
        'age': [20, 19, '21', 'invalid', 22, 20],  # Mixed types
        'email': ['[email protected]', '[email protected]', 'cuong@email', '', '[email protected]', '[email protected]'],
        'score': [85.5, 92.0, None, 78.5, None, 85.5]
    }
    
    messy_df = pd.DataFrame(messy_data)
    
    print("=== ORIGINAL MESSY DATA ===")
    print(messy_df)
    print(f"Shape: {messy_df.shape}")
    
    # 1. Handle missing values
    print("\n=== MISSING VALUES ===")
    print("Missing values per column:")
    print(messy_df.isnull().sum())
    
    # Fill missing values
    cleaned_df = messy_df.copy()
    
    # Fill missing student_id with generated IDs
    missing_ids = cleaned_df['student_id'].isnull()
    cleaned_df.loc[missing_ids, 'student_id'] = 'SV003'
    
    # Fill missing names
    cleaned_df['name'].fillna('Unknown Student', inplace=True)
    
    # Fill missing scores with median
    median_score = pd.to_numeric(cleaned_df['score'], errors='coerce').median()
    cleaned_df['score'].fillna(median_score, inplace=True)
    
    print(f"Median score: {median_score}")
    
    # 2. Clean string data
    print("\n=== STRING CLEANING ===")
    # Strip whitespace và standardize case
    cleaned_df['name'] = cleaned_df['name'].str.strip().str.title()
    cleaned_df['email'] = cleaned_df['email'].str.strip().str.lower()
    
    # Handle invalid emails
    valid_email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    invalid_emails = ~cleaned_df['email'].str.match(valid_email_pattern, na=False)
    cleaned_df.loc[invalid_emails, 'email'] = None
    
    print("After email cleaning:")
    print(cleaned_df['email'].value_counts(dropna=False))
    
    # 3. Fix data types
    print("\n=== DATA TYPE CONVERSION ===")
    # Convert age to numeric, replacing invalid values with NaN
    cleaned_df['age'] = pd.to_numeric(cleaned_df['age'], errors='coerce')
    
    # Fill invalid ages with median
    median_age = cleaned_df['age'].median()
    cleaned_df['age'].fillna(median_age, inplace=True)
    cleaned_df['age'] = cleaned_df['age'].astype(int)
    
    # Convert score to numeric
    cleaned_df['score'] = pd.to_numeric(cleaned_df['score'], errors='coerce')
    
    print("Data types after conversion:")
    print(cleaned_df.dtypes)
    
    # 4. Remove duplicates
    print("\n=== DUPLICATE REMOVAL ===")
    print(f"Duplicates found: {cleaned_df.duplicated().sum()}")
    
    # Remove duplicates based on student_id
    cleaned_df = cleaned_df.drop_duplicates(subset=['student_id'], keep='first')
    
    print("=== FINAL CLEANED DATA ===")
    print(cleaned_df)
    print(f"Final shape: {cleaned_df.shape}")
    
    return cleaned_df

# cleaned_data = data_cleaning_examples()

📊 3. DATA ANALYSIS TECHNIQUES

Descriptive Statistics

def descriptive_analysis():
    """Phân tích thống kê mô tả."""
    
    df = create_sample_data()
    
    print("=== BASIC STATISTICS ===")
    # Central tendency
    gpa_stats = {
        'Mean (Trung bình)': df['gpa'].mean(),
        'Median (Trung vị)': df['gpa'].median(),
        'Mode (Yếu vị)': df['gpa'].mode().iloc[0] if not df['gpa'].mode().empty else 'No mode',
        'Standard Deviation (Độ lệch chuẩn)': df['gpa'].std(),
        'Variance (Phương sai)': df['gpa'].var()
    }
    
    for stat, value in gpa_stats.items():
        print(f"{stat}: {value:.2f}" if isinstance(value, (int, float)) else f"{stat}: {value}")
    
    print("\n=== QUANTILES ===")
    quantiles = df['gpa'].quantile([0.25, 0.5, 0.75])
    print(f"Q1 (25%): {quantiles[0.25]:.2f}")
    print(f"Q2 (50%): {quantiles[0.5]:.2f}")
    print(f"Q3 (75%): {quantiles[0.75]:.2f}")
    print(f"IQR: {quantiles[0.75] - quantiles[0.25]:.2f}")
    
    print("\n=== DISTRIBUTION ANALYSIS ===")
    # Skewness và Kurtosis
    from scipy.stats import skew, kurtosis
    print(f"Skewness (Độ lệch): {skew(df['gpa']):.3f}")
    print(f"Kurtosis (Độ nhọn): {kurtosis(df['gpa']):.3f}")
    
    print("\n=== CORRELATION ANALYSIS ===")
    # Correlation matrix cho numeric columns
    numeric_cols = ['age', 'gpa', 'credits']
    correlation_matrix = df[numeric_cols].corr()
    print("Correlation Matrix:")
    print(correlation_matrix.round(3))

# descriptive_analysis()

Group Analysis

def group_analysis():
    """Phân tích theo nhóm (GroupBy)."""
    
    # Tạo dataset lớn hơn
    np.random.seed(42)
    departments = ['IT', 'Business', 'Engineering', 'Science', 'Arts']
    genders = ['M', 'F']
    
    large_data = []
    for i in range(100):
        student = {
            'student_id': f'SV{i+1:03d}',
            'name': f'Student {i+1}',
            'department': np.random.choice(departments),
            'gender': np.random.choice(genders),
            'age': np.random.randint(18, 25),
            'gpa': np.random.normal(7.5, 1.5),  # Normal distribution
            'credits': np.random.randint(30, 60),
            'semester': np.random.choice(['Fall', 'Spring', 'Summer'])
        }
        # Ensure GPA is in valid range
        student['gpa'] = max(0, min(10, student['gpa']))
        large_data.append(student)
    
    df = pd.DataFrame(large_data)
    
    print("=== GROUP BY DEPARTMENT ===")
    dept_summary = df.groupby('department').agg({
        'gpa': ['mean', 'std', 'min', 'max', 'count'],
        'age': 'mean',
        'credits': 'sum'
    }).round(2)
    print(dept_summary)
    
    print("\n=== GROUP BY MULTIPLE COLUMNS ===")
    gender_dept_summary = df.groupby(['department', 'gender']).agg({
        'gpa': 'mean',
        'student_id': 'count'  # Count students
    }).round(2)
    gender_dept_summary.columns = ['avg_gpa', 'student_count']
    print(gender_dept_summary)
    
    print("\n=== ADVANCED GROUPBY ===")
    # Custom aggregation functions
    def gpa_range(series):
        return series.max() - series.min()
    
    def top_performer_count(series):
        return (series >= 8.5).sum()
    
    advanced_summary = df.groupby('department').agg({
        'gpa': ['mean', gpa_range, top_performer_count],
        'age': lambda x: x.mode().iloc[0]  # Most common age
    }).round(2)
    
    print("Advanced aggregations:")
    print(advanced_summary)
    
    print("\n=== PIVOT TABLES ===")
    # Pivot table - Excel-like functionality
    pivot_table = pd.pivot_table(
        df, 
        values='gpa', 
        index='department', 
        columns='gender',
        aggfunc=['mean', 'count'],
        fill_value=0
    ).round(2)
    
    print("Pivot table - GPA by Department and Gender:")
    print(pivot_table)
    
    return df

# large_df = group_analysis()

Time Series Analysis Basics

def time_series_basics():
    """Cơ bản về time series analysis."""
    
    # Tạo time series data
    dates = pd.date_range('2023-01-01', periods=365, freq='D')
    np.random.seed(42)
    
    # Simulate student enrollment over time
    base_enrollment = 100
    trend = np.linspace(0, 50, 365)  # Increasing trend
    seasonal = 20 * np.sin(2 * np.pi * np.arange(365) / 365)  # Yearly seasonality
    noise = np.random.normal(0, 10, 365)
    
    enrollment = base_enrollment + trend + seasonal + noise
    
    ts_df = pd.DataFrame({
        'date': dates,
        'enrollment': enrollment
    })
    ts_df.set_index('date', inplace=True)
    
    print("=== TIME SERIES OVERVIEW ===")
    print(f"Date range: {ts_df.index.min()} to {ts_df.index.max()}")
    print(f"Total records: {len(ts_df)}")
    print("\nFirst 5 records:")
    print(ts_df.head())
    
    print("\n=== RESAMPLING ===")
    # Monthly aggregation
    monthly_enrollment = ts_df.resample('M').agg({
        'enrollment': ['mean', 'sum', 'std']
    }).round(2)
    print("Monthly aggregation:")
    print(monthly_enrollment.head())
    
    # Weekly aggregation
    weekly_enrollment = ts_df.resample('W').mean().round(2)
    print("\nWeekly average enrollment:")
    print(weekly_enrollment.head())
    
    print("\n=== ROLLING STATISTICS ===")
    # Moving averages
    ts_df['ma_7'] = ts_df['enrollment'].rolling(window=7).mean()
    ts_df['ma_30'] = ts_df['enrollment'].rolling(window=30).mean()
    
    print("With moving averages:")
    print(ts_df.head(10))
    
    print("\n=== SEASONAL DECOMPOSITION ===")
    # Basic seasonal pattern detection
    ts_df['month'] = ts_df.index.month
    monthly_pattern = ts_df.groupby('month')['enrollment'].mean()
    
    print("Average enrollment by month:")
    for month, avg_enrollment in monthly_pattern.items():
        month_name = pd.Timestamp(f'2023-{month:02d}-01').strftime('%B')
        print(f"{month_name}: {avg_enrollment:.1f}")
    
    return ts_df

# ts_data = time_series_basics()

📈 4. DATA VISUALIZATION

Matplotlib Fundamentals

def matplotlib_examples():
    """Các loại chart cơ bản với matplotlib."""
    
    df = create_sample_data()
    
    # Setup subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Student Data Visualization', fontsize=16, fontweight='bold')
    
    # 1. Bar Chart - GPA by Department
    dept_gpa = df.groupby('department')['gpa'].mean().sort_values(ascending=False)
    axes[0, 0].bar(dept_gpa.index, dept_gpa.values, color='skyblue', edgecolor='navy')
    axes[0, 0].set_title('Average GPA by Department')
    axes[0, 0].set_ylabel('GPA')
    axes[0, 0].tick_params(axis='x', rotation=45)
    
    # Add value labels on bars
    for i, v in enumerate(dept_gpa.values):
        axes[0, 0].text(i, v + 0.1, f'{v:.2f}', ha='center', va='bottom')
    
    # 2. Histogram - Age Distribution
    axes[0, 1].hist(df['age'], bins=5, color='lightgreen', edgecolor='darkgreen', alpha=0.7)
    axes[0, 1].set_title('Age Distribution')
    axes[0, 1].set_xlabel('Age')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Scatter Plot - GPA vs Credits
    colors = ['red' if gender == 'M' else 'blue' for gender in df['gender']]
    axes[1, 0].scatter(df['credits'], df['gpa'], c=colors, alpha=0.7, s=100)
    axes[1, 0].set_title('GPA vs Credits (Red=Male, Blue=Female)')
    axes[1, 0].set_xlabel('Credits')
    axes[1, 0].set_ylabel('GPA')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Add correlation coefficient
    correlation = df['credits'].corr(df['gpa'])
    axes[1, 0].text(0.05, 0.95, f'Correlation: {correlation:.3f}', 
                   transform=axes[1, 0].transAxes, fontsize=10,
                   bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    # 4. Pie Chart - Gender Distribution
    gender_counts = df['gender'].value_counts()
    gender_labels = ['Nam' if g == 'M' else 'Nữ' for g in gender_counts.index]
    colors_pie = ['lightblue', 'lightpink']
    
    wedges, texts, autotexts = axes[1, 1].pie(gender_counts.values, labels=gender_labels, 
                                             colors=colors_pie, autopct='%1.1f%%', startangle=90)
    axes[1, 1].set_title('Gender Distribution')
    
    # Improve text appearance
    for autotext in autotexts:
        autotext.set_color('black')
        autotext.set_fontweight('bold')
    
    plt.tight_layout()
    plt.show()

# matplotlib_examples()

Seaborn Advanced Visualizations

def seaborn_examples():
    """Advanced visualizations với Seaborn."""
    
    # Tạo larger dataset
    large_df = group_analysis()  # Reuse từ function trước
    
    # Setup style
    plt.style.use('default')
    sns.set_palette("husl")
    
    # Create figure với multiple subplots
    fig = plt.figure(figsize=(20, 15))
    
    # 1. Distribution Plot
    plt.subplot(3, 3, 1)
    sns.histplot(data=large_df, x='gpa', hue='gender', multiple='dodge', bins=20)
    plt.title('GPA Distribution by Gender')
    
    # 2. Box Plot
    plt.subplot(3, 3, 2)
    sns.boxplot(data=large_df, x='department', y='gpa')
    plt.title('GPA Distribution by Department')
    plt.xticks(rotation=45)
    
    # 3. Violin Plot
    plt.subplot(3, 3, 3)
    sns.violinplot(data=large_df, x='semester', y='gpa', hue='gender')
    plt.title('GPA by Semester and Gender')
    
    # 4. Correlation Heatmap
    plt.subplot(3, 3, 4)
    numeric_cols = ['age', 'gpa', 'credits']
    correlation_matrix = large_df[numeric_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                square=True, fmt='.3f')
    plt.title('Correlation Heatmap')
    
    # 5. Pair Plot (trong subplot riêng)
    plt.subplot(3, 3, 5)
    # Scatter plot matrix-like
    sns.scatterplot(data=large_df, x='age', y='gpa', hue='department', s=60)
    plt.title('Age vs GPA by Department')
    
    # 6. Count Plot
    plt.subplot(3, 3, 6)
    sns.countplot(data=large_df, x='department', hue='gender')
    plt.title('Student Count by Department and Gender')
    plt.xticks(rotation=45)
    
    # 7. Strip Plot
    plt.subplot(3, 3, 7)
    sns.stripplot(data=large_df, x='department', y='credits', 
                  hue='gender', dodge=True, alpha=0.7)
    plt.title('Credits by Department and Gender')
    plt.xticks(rotation=45)
    
    # 8. Regression Plot
    plt.subplot(3, 3, 8)
    sns.regplot(data=large_df, x='credits', y='gpa', scatter_kws={'alpha':0.6})
    plt.title('GPA vs Credits (with regression line)')
    
    # 9. Categorical Plot
    plt.subplot(3, 3, 9)
    sns.barplot(data=large_df, x='department', y='gpa', hue='semester', 
                ci=95, capsize=0.05)
    plt.title('Average GPA by Department and Semester')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Separate detailed pair plot
    print("Creating detailed pair plot...")
    numeric_data = large_df[['age', 'gpa', 'credits']].copy()
    pairplot = sns.pairplot(large_df, vars=['age', 'gpa', 'credits'], 
                           hue='department', diag_kind='hist', 
                           plot_kws={'alpha': 0.6})
    pairplot.fig.suptitle('Pair Plot of Numeric Variables', y=1.02)
    plt.show()

# seaborn_examples()

Interactive Plotting với Plotly

def plotly_examples():
    """Interactive visualizations với Plotly."""
    
    try:
        import plotly.graph_objects as go
        import plotly.express as px
        from plotly.subplots import make_subplots
        
        large_df = group_analysis()  # Reuse data
        
        print("Creating interactive Plotly visualizations...")
        
        # 1. Interactive Scatter Plot
        fig_scatter = px.scatter(
            large_df, 
            x='credits', 
            y='gpa',
            color='department',
            size='age',
            hover_data=['student_id', 'gender'],
            title='Interactive Scatter: GPA vs Credits by Department',
            labels={'credits': 'Total Credits', 'gpa': 'GPA'}
        )
        fig_scatter.update_layout(height=500)
        
        # 2. Interactive Bar Chart
        dept_summary = large_df.groupby(['department', 'gender']).agg({
            'gpa': 'mean',
            'student_id': 'count'
        }).round(2).reset_index()
        dept_summary.columns = ['department', 'gender', 'avg_gpa', 'count']
        
        fig_bar = px.bar(
            dept_summary,
            x='department',
            y='avg_gpa',
            color='gender',
            title='Average GPA by Department and Gender',
            hover_data=['count']
        )
        
        # 3. Interactive Line Chart (Time series simulation)
        # Create time series data
        dates = pd.date_range('2023-01-01', periods=52, freq='W')
        ts_data = []
        for dept in large_df['department'].unique():
            dept_students = len(large_df[large_df['department'] == dept])
            for i, date in enumerate(dates):
                # Simulate weekly enrollment with some trend
                base_count = dept_students / 10
                seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * i / 52)
                weekly_count = base_count * seasonal_factor + np.random.normal(0, 1)
                
                ts_data.append({
                    'date': date,
                    'department': dept,
                    'weekly_enrollment': max(0, weekly_count)
                })
        
        ts_df = pd.DataFrame(ts_data)
        
        fig_line = px.line(
            ts_df,
            x='date',
            y='weekly_enrollment',
            color='department',
            title='Weekly Enrollment Trends by Department',
            labels={'weekly_enrollment': 'Weekly Enrollment'}
        )
        
        # 4. 3D Scatter Plot
        fig_3d = px.scatter_3d(
            large_df,
            x='age',
            y='credits',
            z='gpa',
            color='department',
            size='gpa',
            hover_data=['student_id'],
            title='3D View: Age, Credits, and GPA'
        )
        
        # Display instructions
        print("Plotly figures created successfully!")
        print("To display in Jupyter notebook, use:")
        print("fig_scatter.show()")
        print("fig_bar.show()")
        print("fig_line.show()")
        print("fig_3d.show()")
        
        # Return figures for potential use
        return {
            'scatter': fig_scatter,
            'bar': fig_bar,
            'line': fig_line,
            '3d': fig_3d
        }
        
    except ImportError:
        print("Plotly not installed. Install with: pip install plotly")
        return None

# plotly_figs = plotly_examples()

🔍 5. ADVANCED ANALYSIS TECHNIQUES

Statistical Tests

def statistical_tests():
    """Các test thống kê cơ bản."""
    
    large_df = group_analysis()  # Reuse data
    
    print("=== NORMALITY TESTS ===")
    from scipy.stats import shapiro, normaltest, kstest
    
    # Test if GPA follows normal distribution
    gpa_data = large_df['gpa'].dropna()
    
    # Shapiro-Wilk test
    shapiro_stat, shapiro_p = shapiro(gpa_data)
    print(f"Shapiro-Wilk Test:")
    print(f"  Statistic: {shapiro_stat:.4f}")
    print(f"  P-value: {shapiro_p:.4f}")
    print(f"  Normal distribution: {'Yes' if shapiro_p > 0.05 else 'No'}")
    
    # D'Agostino's test
    dagostino_stat, dagostino_p = normaltest(gpa_data)
    print(f"\nD'Agostino's Test:")
    print(f"  Statistic: {dagostino_stat:.4f}")
    print(f"  P-value: {dagostino_p:.4f}")
    
    print("\n=== T-TESTS ===")
    from scipy.stats import ttest_ind, ttest_1samp
    
    # One-sample t-test: Is average GPA significantly different from 7.0?
    t_stat, t_p = ttest_1samp(gpa_data, 7.0)
    print(f"One-sample t-test (H0: mean GPA = 7.0):")
    print(f"  T-statistic: {t_stat:.4f}")
    print(f"  P-value: {t_p:.4f}")
    print(f"  Significant difference: {'Yes' if t_p < 0.05 else 'No'}")
    
    # Two-sample t-test: Compare GPA between genders
    male_gpa = large_df[large_df['gender'] == 'M']['gpa'].dropna()
    female_gpa = large_df[large_df['gender'] == 'F']['gpa'].dropna()
    
    t_stat, t_p = ttest_ind(male_gpa, female_gpa)
    print(f"\nTwo-sample t-test (Male vs Female GPA):")
    print(f"  Male GPA mean: {male_gpa.mean():.3f}")
    print(f"  Female GPA mean: {female_gpa.mean():.3f}")
    print(f"  T-statistic: {t_stat:.4f}")
    print(f"  P-value: {t_p:.4f}")
    print(f"  Significant difference: {'Yes' if t_p < 0.05 else 'No'}")
    
    print("\n=== ANOVA ===")
    from scipy.stats import f_oneway
    
    # One-way ANOVA: Compare GPA across departments
    dept_groups = [group['gpa'].dropna() for name, group in large_df.groupby('department')]
    f_stat, f_p = f_oneway(*dept_groups)
    
    print(f"One-way ANOVA (GPA across departments):")
    print(f"  F-statistic: {f_stat:.4f}")
    print(f"  P-value: {f_p:.4f}")
    print(f"  Significant difference: {'Yes' if f_p < 0.05 else 'No'}")
    
    # Post-hoc analysis if significant
    if f_p < 0.05:
        print("\n  Department means:")
        for dept, group in large_df.groupby('department'):
            mean_gpa = group['gpa'].mean()
            print(f"    {dept}: {mean_gpa:.3f}")
    
    print("\n=== CORRELATION TESTS ===")
    from scipy.stats import pearsonr, spearmanr
    
    # Pearson correlation
    pearson_r, pearson_p = pearsonr(large_df['credits'], large_df['gpa'])
    print(f"Pearson Correlation (Credits vs GPA):")
    print(f"  Correlation coefficient: {pearson_r:.4f}")
    print(f"  P-value: {pearson_p:.4f}")
    print(f"  Significant correlation: {'Yes' if pearson_p < 0.05 else 'No'}")
    
    # Spearman correlation (non-parametric)
    spearman_r, spearman_p = spearmanr(large_df['credits'], large_df['gpa'])
    print(f"\nSpearman Correlation (Credits vs GPA):")
    print(f"  Correlation coefficient: {spearman_r:.4f}")
    print(f"  P-value: {spearman_p:.4f}")

# statistical_tests()

Linear Regression Analysis

def regression_analysis():
    """Linear regression analysis."""
    
    try:
        from sklearn.linear_model import LinearRegression
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import r2_score, mean_squared_error
        import statsmodels.api as sm
        
        large_df = group_analysis()  # Reuse data
        
        print("=== SIMPLE LINEAR REGRESSION ===")
        # Predict GPA based on credits
        X_simple = large_df[['credits']].values
        y = large_df['gpa'].values
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X_simple, y, test_size=0.2, random_state=42
        )
        
        # Fit model
        model_simple = LinearRegression()
        model_simple.fit(X_train, y_train)
        
        # Predictions
        y_pred = model_simple.predict(X_test)
        
        # Metrics
        r2 = r2_score(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        
        print(f"Simple Regression (GPA ~ Credits):")
        print(f"  Coefficient: {model_simple.coef_[0]:.4f}")
        print(f"  Intercept: {model_simple.intercept_:.4f}")
        print(f"  R-squared: {r2:.4f}")
        print(f"  MSE: {mse:.4f}")
        print(f"  Equation: GPA = {model_simple.intercept_:.3f} + {model_simple.coef_[0]:.3f} * Credits")
        
        print("\n=== MULTIPLE LINEAR REGRESSION ===")
        # Encode categorical variables
        df_encoded = large_df.copy()
        df_encoded = pd.get_dummies(df_encoded, columns=['department', 'gender', 'semester'])
        
        # Select features
        feature_cols = ['age', 'credits'] + [col for col in df_encoded.columns if col.startswith(('department_', 'gender_'))]
        X_multiple = df_encoded[feature_cols].values
        
        # Split and fit
        X_train_mult, X_test_mult, y_train_mult, y_test_mult = train_test_split(
            X_multiple, y, test_size=0.2, random_state=42
        )
        
        model_multiple = LinearRegression()
        model_multiple.fit(X_train_mult, y_train_mult)
        
        # Predictions and metrics
        y_pred_mult = model_multiple.predict(X_test_mult)
        r2_mult = r2_score(y_test_mult, y_pred_mult)
        mse_mult = mean_squared_error(y_test_mult, y_pred_mult)
        
        print(f"Multiple Regression:")
        print(f"  R-squared: {r2_mult:.4f}")
        print(f"  MSE: {mse_mult:.4f}")
        print(f"  Improvement over simple: {((r2_mult - r2) / r2 * 100):.1f}%")
        
        # Feature importance
        print(f"\n  Feature Coefficients:")
        for i, feature in enumerate(feature_cols):
            print(f"    {feature}: {model_multiple.coef_[i]:.4f}")
        
        print("\n=== STATISTICAL SIGNIFICANCE (using statsmodels) ===")
        # Add constant for intercept
        X_with_const = sm.add_constant(X_multiple)
        
        # Fit OLS model
        ols_model = sm.OLS(y, X_with_const).fit()
        
        print("OLS Regression Results Summary:")
        print(f"  R-squared: {ols_model.rsquared:.4f}")
        print(f"  Adjusted R-squared: {ols_model.rsquared_adj:.4f}")
        print(f"  F-statistic: {ols_model.fvalue:.4f}")
        print(f"  F-statistic p-value: {ols_model.f_pvalue:.6f}")
        
        # Coefficient significance
        print(f"\n  Significant coefficients (p < 0.05):")
        for i, (feature, p_val) in enumerate(zip(['const'] + feature_cols, ols_model.pvalues)):
            if p_val < 0.05:
                coef = ols_model.params[i]
                print(f"    {feature}: {coef:.4f} (p = {p_val:.4f})")
        
        return {
            'simple_model': model_simple,
            'multiple_model': model_multiple,
            'ols_model': ols_model
        }
        
    except ImportError:
        print("scikit-learn or statsmodels not installed.")
        print("Install with: pip install scikit-learn statsmodels")
        return None

# regression_models = regression_analysis()

📊 6. REAL-WORLD PROJECT EXAMPLE

Complete Data Analysis Pipeline

def complete_data_analysis_project():
    """Một project hoàn chỉnh từ A-Z."""
    
    print("🎯 PROJECT: PHÂN TÍCH HIỆU SUẤT HỌC SINH")
    print("=" * 50)
    
    # 1. DATA GENERATION & LOADING
    print("\n📊 STEP 1: LOADING DATA")
    np.random.seed(42)
    
    # Simulate realistic student data
    n_students = 500
    departments = ['Computer Science', 'Business', 'Engineering', 'Mathematics', 'Physics']
    cities = ['Hà Nội', 'TP.HCM', 'Đà Nẵng', 'Cần Thơ', 'Hải Phòng']
    
    student_data = []
    for i in range(n_students):
        dept = np.random.choice(departments)
        
        # Department influences base GPA
        dept_gpa_base = {
            'Computer Science': 7.8,
            'Business': 7.2,
            'Engineering': 7.5,
            'Mathematics': 8.0,
            'Physics': 7.7
        }
        
        base_gpa = dept_gpa_base[dept]
        gpa = np.random.normal(base_gpa, 1.2)
        gpa = max(0, min(10, gpa))  # Clamp to valid range
        
        student = {
            'student_id': f'SV{i+1:04d}',
            'name': f'Student {i+1}',
            'department': dept,
            'gender': np.random.choice(['M', 'F']),
            'age': np.random.randint(18, 25),
            'city': np.random.choice(cities),
            'gpa': gpa,
            'credits_completed': np.random.randint(30, 120),
            'study_hours_per_week': np.random.normal(25, 8),
            'family_income': np.random.choice(['Low', 'Medium', 'High'], p=[0.3, 0.5, 0.2]),
            'has_scholarship': np.random.choice([True, False], p=[0.2, 0.8]),
            'extracurricular_activities': np.random.randint(0, 5)
        }
        
        # Ensure positive study hours
        student['study_hours_per_week'] = max(5, student['study_hours_per_week'])
        
        student_data.append(student)
    
    df = pd.DataFrame(student_data)
    print(f"✅ Loaded {len(df)} student records")
    print(f"📋 Columns: {list(df.columns)}")
    
    # 2. DATA EXPLORATION
    print("\n🔍 STEP 2: DATA EXPLORATION")
    print(f"Dataset shape: {df.shape}")
    print(f"Missing values: {df.isnull().sum().sum()}")
    
    # Basic statistics
    print("\n📊 Key Statistics:")
    print(f"  Average GPA: {df['gpa'].mean():.2f}")
    print(f"  GPA Range: {df['gpa'].min():.2f} - {df['gpa'].max():.2f}")
    print(f"  Average Study Hours: {df['study_hours_per_week'].mean():.1f}")
    print(f"  Scholarship Recipients: {df['has_scholarship'].sum()} ({df['has_scholarship'].mean()*100:.1f}%)")
    
    # Department distribution
    print(f"\n🏫 Students by Department:")
    dept_counts = df['department'].value_counts()
    for dept, count in dept_counts.items():
        print(f"  {dept}: {count} ({count/len(df)*100:.1f}%)")
    
    # 3. DATA CLEANING
    print("\n🧹 STEP 3: DATA CLEANING")
    # Already clean data, but let's show the process
    
    # Check for outliers in GPA
    Q1 = df['gpa'].quantile(0.25)
    Q3 = df['gpa'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df['gpa'] < lower_bound) | (df['gpa'] > upper_bound)]
    print(f"📈 GPA Outliers detected: {len(outliers)}")
    
    if len(outliers) > 0:
        print("Outlier GPAs:", outliers['gpa'].tolist())
    
    # 4. ANALYSIS
    print("\n📊 STEP 4: DETAILED ANALYSIS")
    
    # 4.1 Department Analysis
    print("\n🏫 Department Performance Analysis:")
    dept_analysis = df.groupby('department').agg({
        'gpa': ['mean', 'std', 'min', 'max'],
        'study_hours_per_week': 'mean',
        'has_scholarship': lambda x: x.sum() / len(x),
        'student_id': 'count'
    }).round(3)
    
    print(dept_analysis)
    
    # 4.2 Gender Analysis
    print("\n👥 Gender Performance Analysis:")
    gender_analysis = df.groupby(['department', 'gender']).agg({
        'gpa': 'mean',
        'study_hours_per_week': 'mean'
    }).round(3)
    print(gender_analysis)
    
    # 4.3 Correlation Analysis
    print("\n🔗 Correlation Analysis:")
    numeric_cols = ['age', 'gpa', 'credits_completed', 'study_hours_per_week', 'extracurricular_activities']
    correlations = df[numeric_cols].corr()['gpa'].sort_values(ascending=False)
    
    print("Factors most correlated with GPA:")
    for factor, corr in correlations.items():
        if factor != 'gpa':
            print(f"  {factor}: {corr:.3f}")
    
    # 5. STATISTICAL TESTS
    print("\n📈 STEP 5: STATISTICAL TESTING")
    
    # Test if scholarship students have higher GPA
    scholarship_gpa = df[df['has_scholarship']]['gpa']
    no_scholarship_gpa = df[~df['has_scholarship']]['gpa']
    
    from scipy.stats import ttest_ind
    t_stat, p_value = ttest_ind(scholarship_gpa, no_scholarship_gpa)
    
    print(f"🎓 Scholarship Impact on GPA:")
    print(f"  With scholarship: {scholarship_gpa.mean():.3f} ± {scholarship_gpa.std():.3f}")
    print(f"  Without scholarship: {no_scholarship_gpa.mean():.3f} ± {no_scholarship_gpa.std():.3f}")
    print(f"  T-test p-value: {p_value:.6f}")
    print(f"  Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
    
    # 6. PREDICTIVE MODELING
    print("\n🤖 STEP 6: PREDICTIVE MODELING")
    
    try:
        from sklearn.ensemble import RandomForestRegressor
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import r2_score, mean_absolute_error
        
        # Prepare features
        df_model = df.copy()
        df_model = pd.get_dummies(df_model, columns=['department', 'gender', 'city', 'family_income'])
        
        # Select features
        feature_cols = [col for col in df_model.columns if col not in 
                       ['student_id', 'name', 'gpa']]
        
        X = df_model[feature_cols]
        y = df_model['gpa']
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Train model
        rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
        rf_model.fit(X_train, y_train)
        
        # Evaluate
        y_pred = rf_model.predict(X_test)
        r2 = r2_score(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        
        print(f"🎯 Random Forest Model Performance:")
        print(f"  R-squared: {r2:.4f}")
        print(f"  Mean Absolute Error: {mae:.4f}")
        
        # Feature importance
        feature_importance = pd.DataFrame({
            'feature': feature_cols,
            'importance': rf_model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        print(f"\n🔝 Top 5 Most Important Features:")
        for _, row in feature_importance.head().iterrows():
            print(f"  {row['feature']}: {row['importance']:.4f}")
        
    except ImportError:
        print("scikit-learn not available for modeling")
    
    # 7. INSIGHTS & RECOMMENDATIONS
    print("\n💡 STEP 7: KEY INSIGHTS & RECOMMENDATIONS")
    print("=" * 50)
    
    insights = [
        f"📚 Students spend average {df['study_hours_per_week'].mean():.1f} hours/week studying",
        f"🏆 {dept_counts.index[0]} has most students ({dept_counts.iloc[0]})",
        f"🎓 Scholarship students have {'higher' if scholarship_gpa.mean() > no_scholarship_gpa.mean() else 'similar'} GPA",
        f"📊 Study hours correlation with GPA: {df['study_hours_per_week'].corr(df['gpa']):.3f}",
        f"🎯 {len(df[df['gpa'] >= 8.0])} students ({len(df[df['gpa'] >= 8.0])/len(df)*100:.1f}%) achieve GPA ≥ 8.0"
    ]
    
    for i, insight in enumerate(insights, 1):
        print(f"{i}. {insight}")
    
    print(f"\n🎯 RECOMMENDATIONS:")
    recommendations = [
        "Increase study support for students with <20 hours/week",
        "Expand scholarship programs (positive correlation with performance)",
        "Focus on departments with lower average GPAs",
        "Encourage balanced extracurricular participation",
        "Implement early warning system for at-risk students"
    ]
    
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec}")
    
    return df

# Chạy complete analysis
# final_df = complete_data_analysis_project()

🔗 Liên Kết Đến Các Bài Học Khác

📋 Python Cheatsheet - Tham khảo nhanh syntax
🔧 Built-in Functions - Functions hữu ích cho data analysis
📦 Common Modules - Modules quan trọng
⚡ Performance Tips - Tối ưu code cho big data

🎯 Tóm Tắt

📊 Essential Libraries:

pandas - Data manipulation và analysis
NumPy - Numerical computing
matplotlib - Static plotting
seaborn - Statistical visualization
scipy - Scientific computing
plotly - Interactive visualization

🔧 Core Skills:

Data Loading - CSV, Excel, JSON, databases
Data Cleaning - Handle missing values, outliers, duplicates
Data Exploration - Descriptive statistics, distributions
Data Visualization - Charts, plots, dashboards
Statistical Analysis - Hypothesis testing, correlations
Modeling - Linear regression, predictions

📈 Analysis Workflow:

Load & Explore - Understand your data
Clean & Prepare - Handle quality issues
Analyze & Visualize - Find patterns
Test Hypotheses - Statistical validation
Model & Predict - Build predictive models
Communicate Results - Insights & recommendations

💡 Best Practices:

Start Simple - Basic stats before advanced modeling
Visualize Early - Plots reveal insights quickly
Validate Assumptions - Test statistical requirements
Document Process - Reproducible analysis
Question Results - Does it make business sense?

🎯 Common Pitfalls:

Data Leakage - Using future information
Overfitting - Model too complex for data
Correlation ≠ Causation - Don't assume cause
Selection Bias - Non-representative samples
Missing Context - Numbers without business understanding

📝 Được cập nhật: Tháng 9, 2024
💡 Tip: "In God we trust, all others must bring data" - W. Edwards Deming

🎯 Tại Sao Cần Phân Tích Dữ Liệu?​

📦 1. ESSENTIAL LIBRARIES​

Cài Đặt Môi Trường​

Import Statement Standards​

🐼 2. PANDAS ESSENTIALS​

DataFrame Basics​

Data Loading & Saving​

Data Selection & Filtering​

Data Cleaning​

📊 3. DATA ANALYSIS TECHNIQUES​

Descriptive Statistics​

Group Analysis​

Time Series Analysis Basics​

📈 4. DATA VISUALIZATION​

Matplotlib Fundamentals​

Seaborn Advanced Visualizations​

Interactive Plotting với Plotly​

🔍 5. ADVANCED ANALYSIS TECHNIQUES​

Statistical Tests​

Linear Regression Analysis​

📊 6. REAL-WORLD PROJECT EXAMPLE​

Complete Data Analysis Pipeline​

🔗 Liên Kết Đến Các Bài Học Khác​

🎯 Tóm Tắt​

📊 Essential Libraries:​

🔧 Core Skills:​

📈 Analysis Workflow:​

💡 Best Practices:​

🎯 Common Pitfalls:​