一些筆記, 主要內容比較針對ML
Note for Amazon SageMaker Enablement Workshop
Techinical trainer: Cathy Lai
basis ML flow
基本上這張圖解釋了ML在解決問題的整個flow
Machine Learning Mechanisim
ML type
-
unsupervised clustering (ex: 推薦系統) dimensionalitty reduction
-
supervised classification (ex: 股票買賣) regression (ex: 股票價格)
-
Reinforcement
在本堂課主要是針對supervised 目標是用ML分析潛在可能流失的電信客戶
ML tips
- data visualzation 在Model training之前, 針對輸入資料做checking
Statistic Analysis:
• Usually use libraries like pandas, NumPy or Matplotlib
• Use Python function like: hist(), describe(), crosstab() & select_dtypes().
• Use Ipython hist library:%matplotlib inline
-
data prepartion
-
Data Cleansing 這部分算是收獲最大的, 以前並沒有做過類似的行為
• Disable unused column before model training
• Use pandas functions corr() and scatter_matrix()
• Confirm the algorithm used after data cleansing
• Important: remove 100% correlation column
Which algorithm?
Classification:
• Linear Learner • XGBoost • Factorization Machines • SVMs (Spark, BYO)
Regression:
• Linear Learner • XGBoost • Factorization Machines • SVMs (Spark, BYO)
Recommendations:
• Factorization Machines • Collaborative Filtering (Spark) • Matrix Factorization (BYO)
Forecasting:
• DeepAR • Linear Learner • XGBoost • Prophet (BYO) • ARIMA (BYO) • EST (BYO)
Clustering:
• K-Means • Hotspot Detection (Kinesis Analytics) • DBScan (BYO) • GMMs (BYO)
Dimensionality Reduction/Anomaly Detection:
• PCA • Random Cut Forest (Kinesis Analytics) • t-SNE (BYO) • Manifold Learning (BYO) • Autoencoders (BYO) • SVMs (Spark, BYO)
Hyperparameter tunning
除非很有經驗, 不然似乎只能用暴力破解