干货 | 分析实战案例——用户行为预测
2025-11-26 12:18
data.head
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
data
Dask DataFrame Structure :
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
Dask Name: read-csv, 58 tasks
与pandas不同,这里我们均获取样本框的结构外观设计,而不是实际样本框。Dask已将样本帧分为几块启动时,这些块长期存在 于磁盘上,而不长期存在于RAM中都。如果必须控制器样本帧,则首先须要将所有样本帧都放入RAM,将它们切口在一 起,然后展览品如此一来一的样本帧。常用.compute强迫它这样要用,否则它不.compute 。其实dask常用了一种延误至少 据启动时机制,这种延误机制类似于python的给定器缓冲器,只有当须要常用样本的时候才不会去无论如何启动时样本。
# 无论如何启动时样本 data.compute
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
# 建模岗位进程,58个西区任务 data.visualize
样本预处理过程
样本压缩
# 查看现在的样本多种类型 data.dtypes
U_Id int64
T_Id int64
C_Id int64
Be_type object
Ts int64
dtype: object
# 压缩形同 32位uint,无小写字母整型,因为交易样本很难小至少 dtypes = {
'U_Id': 'uint32',
'T_Id': 'uint32',
'C_Id': 'uint32',
'Be_type': 'object',
'Ts': 'int64'
}
data= data.astype(dtypes)
data.dtypes
U_Id uint32
T_Id uint32
C_Id uint32
Be_type object
Ts int64
dtype: object
紊乱表达式
# 以 dask模块读取的样本,无法这样一来用 .isnull等 pandas常用函至少筛查紊乱表达式 data.isnull
Dask DataFrame Structure :
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
columns1= [ 'U_Id', 'T_Id', 'C_Id', 'Be_type', 'Ts']
tmpDf1 = pd.DataFrame(columns=columns1)
tmpDf1
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
s = data[ "U_Id"].isna
s.loc[s == True]
Dask Series Structure:
npartitions= 58
bool ...
......
...
Name: U_Id, dtype: bool
Dask Name: loc-series, 348tasks
U_Id至多紊乱表达式总至少为0
T_Id至多紊乱表达式总至少为0
C_Id至多紊乱表达式总至少为0
Be_type至多紊乱表达式总至少为0
Ts至多紊乱表达式总至少为0
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
无紊乱表达式
样本探究与建模
这里我们常用pyecharts坎。pyecharts是一款将python与谷歌开源的echarts结合的样本建模工具。新版的1.X和原版的0.5.X新版本预定义规则大 不相同,新版详见官方XML#/README
# pip install pyecharts -i https: //pypi.tuna.tsinghua.edu.cn/simple
Looking inindexes: https: //pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pyecharts ind:anacondalibsite-packages ( 0.1.9.4)
Requirement already satisfied: jinja2 ind:anacondalibsite-packages ( frompyecharts)
( 3.0.2)
Requirement already satisfied: future ind:anacondalibsite-packages ( frompyecharts)
( 0.18.2)
Requirement already satisfied: pillow ind:anacondalibsite-packages ( frompyecharts)
( 8.3.2)
Requirement already satisfied: MarkupSafe>= 2.0ind:anacondalibsite-packages ( from
jinja2->pyecharts) ( 2.0.1)
Note: you may need to restart the kernel to use updated packages.
U_Id至多紊乱表达式总至少为 0T_Id至多紊乱表达式总至少为 0C_Id至多紊乱表达式总至少为 0Be_type至多紊乱表达式总至少为 0Ts至多紊乱表达式总至少为 0
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)
WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)
饼布
# 例如,我们想画一张漂亮的饼布来看各种客户端举动的占比 data[ "Be_type"]
# 常用dask的时候,所有默许的原pandas的函至少后面需加.compute才能如此一来一监督
Be_counts = data[ "Be_type"].value_counts.compute
Be_counts
pv 89716264
cart 5530446
fav 2888258
buy 2015839
Name: Be_type, dtype: int64
Be_index = Be_counts.index.tolist # 提取ID
Be_index
['pv', 'cart', 'fav', 'buy']
Be_values = Be_counts.values.tolist # 提取至少表达式
Be_values
[ 89716264, 5530446, 2888258, 2015839]
frompyecharts importoptions asopts
frompyecharts.charts importPie
#pie这个包里的样本必须中叶由-bit组形同的至多表
c = Pie
c.add( "", [list(z) forz inzip(Be_index, Be_values)]) # zip函至少的作用是将可给定对象打包形同一 个个-bit,然后返回这些-bit组形同的至多表 c.set_global_opts(title_opts=opts.TitleOpts(title= "客户端举动")) # 一个系统参至少(布取名为) c.set_series_opts(label_opts=opts.LabelOpts(formatter= "{b}: {c}"))
c.render_notebook # 控制器到举例来说notebook环境
# c.render( "pie_base.html") # 若须要可以将布控制器到机
< pyecharts.charts.basic_charts.pie.Pieat0 x1b2da75ae48>
< divid= "490361952ca944fcab93351482e4b254"style= "width:900px; height:500px;"> div>
漏斗布
frompyecharts.charts importFunnel # 原版的pyecharts不须要.charts即可import import pyecharts.options as opts
fromIPython.display importImage asIMG
frompyecharts importoptions asopts
frompyecharts.charts importPie
< pyecharts.charts.basic_charts.funnel.Funnelat0 x1b2939d50c8>
< divid= "071b3b906c27405aaf6bc7a686e36aaa"style= "width:800px; height:400px;"> div>
样本比对
间隔时间撕匹配
dask对于间隔时间撕的默许极为不密切联系
type(data)
dask.dataframe.core.DataFrame
data[ 'Ts1']=data[ 'Ts'].apply(lambda x: time.strftime( "%Y-%m-%d %H:%M:%S",
time.localtime( x)))
data[ 'Ts2']=data[ 'Ts'].apply(lambda x: time.strftime( "%Y-%m-%d", time.localtime( x)))
data[ 'Ts3']=data[ 'Ts'].apply(lambda x: time.strftime( "%H:%M:%S", time.localtime( x)))
D:anaconda libsite-packagesdaskdataframecore.py: 3701: UserWarning:
You did notprovide metadata, so Dask isrunning your functionona small dataset to
guess output types. It ispossible that Dask will guess incorrectly.
Toprovide an explicitoutput types ortosilence this message, please provide the
人口为120人meta=人口为120人 keyword, asdescribed inthe map orapply functionthat you are using.
Before: .apply(func)
After: .apply(func, meta=( 'Ts', 'object'))
warnings.warn(meta_warning(meta))
data.head(1)
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
data.dtypes
U_Id uint32
T_Id uint32
C_Id uint32
Be_type object
Ts int64
Ts1 object
Ts2 object
Ts3 object
dtype: object
得来一其余部分样本来调试预定义
df = data.head(1000000)
df.head(1)
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
客户端水流量和购入间隔时间可能比对
客户端举动统计表
describe = df.loc[:,[ "U_Id", "Be_type"]]
ids = pd.DataFrame(np.zeros(len(set(list(df[ "U_Id"])))),index=set(list(df[ "U_Id"])))
pv_class=describe[describe[ "Be_type"]== "pv"].groupby( "U_Id").count
pv_class.columns = [ "pv"]
buy_class=describe[describe[ "Be_type"]== "buy"].groupby( "U_Id").count
buy_class.columns = [ "buy"]
fav_class=describe[describe[ "Be_type"]== "fav"].groupby( "U_Id").count
fav_class.columns = [ "fav"]
cart_class=describe[describe[ "Be_type"]== "cart"].groupby( "U_Id").count
cart_class.columns = [ "cart"]
user_behavior_counts=ids.join(pv_class).join(fav_class).join(cart_class).join(buy_class).
iloc[:,1:]
user_behavior_counts.head
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
总用户数量形同交量间隔时间波动比对(天)
from matplotlib import font_manager
#克服坐标轴刻度负号Bug
#克服负号 '-'显示为正方形的问题 plt.rcParams[ 'axes.unicode_minus'] = False # 克服中都文Bug问题 plt.rcParams['font.sans-serif'] = ['Simhei']
由总用户数量、形同交量间隔时间波动比对悦,从17年11月25日至17年12月1日用户数量和形同交量长期存在小幅波动,2017年12 月2日用户数量和形同交量均注意到大幅提高上升,2日、3日两天保持更高用户数量和更高形同交量。此现像原因之一为12月2日和3 日为周六,同时选择2日3日可能长期存在某些促销活动,可结合实际业务可能进行具体比对。(布中都星期五用户数量有上 升,但形同交量注意到下降,推测此现像可能与周六活动导致星期五推迟形同交有关。)
总用户数量形同交量间隔时间波动比对(两星期)
# 样本准备 df_pv_timestamp=df[df[ "Be_type"]== "pv"] [["Be_type","Ts1"]]df_pv_timestamp[ "Ts1"] = pd.to_datetime(df_pv_timestamp[ "Ts1"])
df_pv_timestamp=df_pv_timestamp.set_index( "Ts1")
df_pv_timestamp=df_pv_timestamp.resample( "H").count[ "Be_type"]
df_pv_timestamp
df_buy_timestamp=df[df[ "Be_type"]== "buy"] [["Be_type","Ts1"]]
df_buy_timestamp[ "Ts1"] = pd.to_datetime(df_buy_timestamp[ "Ts1"])
df_buy_timestamp=df_buy_timestamp.set_index( "Ts1")
df_buy_timestamp=df_buy_timestamp.resample( "H").count[ "Be_type"]
df_buy_timestamp
Ts1
2017 -09-1116 :00:001
2017 -09-1117 :00:000
2017 -09-1118 :00:000
2017 -09-1119 :00:000
2017 -09-1120 :00:000
...
2017 -12-0320 :00:008587
2017 -12-0321 :00:0010413
2017 -12-0322 :00:009862
2017 -12-0323 :00:007226
2017 -12-0400 :00:001
Freq: H, Name: Be_type, Length: 2001, dtype: int64
Ts1
2017 -11-2500 :00:0064
2017 -11-2501 :00:0029
2017 -11-2502 :00:0018
2017 -11-2503 :00:008
2017 -11-2504 :00:003
...
2017 -12-0319 :00:00141
2017 -12-0320 :00:00159
2017 -12-0321 :00:00154
2017 -12-0322 :00:00154
2017 -12-0323 :00:00123
Freq: H, Name: Be_type, Length: 216, dtype: int64
#绘布
plt.figure(figsize=( 20, 6),dpi = 70)
x2= df_buy_timestamp.index plt.plot(range(len(x2)),df_buy_timestamp.values,label= "形同交量",color= "blue",linewidth= 2) plt.title( "总形同交量波动折现布(两星期)")
x2 = [i.strftime( "%Y-%m-%d %H:%M") fori in x2]
plt.xticks(range(len(x2))[:: 4],x2[:: 4],rotation= 90)
plt.xlabel( "Ts2")
plt.ylabel( "Ts3")
plt.grid(alpha= 0. 4);
特征工程
思路:不选择间隔时间车站内,只以客户端的浏览者和展品等举动来预期确实购入 步骤:以客户端ID(U_Id)为组键,将每位客户端的浏览者、展品、加购物车的举动统计借助于来,共有
确实浏览者,浏览者次至少;确实展品,展品次至少;确实加购物车,加购物车次至少
以此来预期如此一来一确实购入
# 去掉间隔时间撕
df= df[[ "U_Id", "T_Id", "C_Id", "Be_type"]] df
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
举动多种类型
U_Id
1[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
100[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 3, ...
115[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, ...
117[ 4, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, ...
118[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Name: Be_type1, dtype: object
最后创建一个DataFrame用来传输等下推算借助于的客户端举动。
df_new= pd.DataFrame
浏览者次至少
df_new[ 'pv_much'] = df_Be.apply( lambdax: Counter(x)[ '1'])
df_new
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
加购次至少
#确实加购
df_new[ 'is_cart'] = df_Be.apply( lambdax: 1if'2'inx else0)
df_new
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
#加购了几次
df_new[ 'cart_much'] = df_Be.apply( lambdax: 0if'2'notinx elseCounter(x)[ '2'])
df_new
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
展品次至少
#确实展品
df_new[ 'is_fav'] = df_Be.apply( lambdax: 1if'3'inx else0)
df_new
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
#展品了几次
df_new[ 'fav_much'] = df_Be.apply( lambdax: 0if'3'notinx elseCounter(x)[ '3'])
df_new
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
相关比对
#其余部分样本 df_new.corr( 'spearman')
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
确实加购与加购次至少、确实展品与展品次至少相互间长期存在一定相关性,但经验仍须列于其中都之一与纳入全部变数敏感度基本一致,故不久常用全部变数建模。
样本ID
importseaborn assns
#确实购入
df_new[ 'is_buy'] = df_Be.apply( lambdax: 1if'4'inx else0)
df_new
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
df_new.is_buy.value_counts
1 6689
0 3050
Name: is_buy, dtype: int64
df_new[ 'label'] = df_new[ 'is_buy']
deldf_new[ 'is_buy']
df_new.head
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
f,ax=plt.subplots( 1, 2,figsize=( 12, 5))
sns.set_palette([ "#9b59b6", "#3498db",]) #所设所有布的蓝色,常用hls色彩空间
sns.distplot(df_new[ 'fav_much'],bins= 30,kde= True,label= '123',ax=ax[ 0]);
sns.distplot(df_new[ 'cart_much'],bins= 30,kde= True,label= '12',ax=ax[ 1]);
C:UsersCDAanaconda3libsite-packagesseaborndistributions.py:2619: FutureWarning:
人口为120人distplot人口为120人 is a deprecated function and will be removed in a future version. Please adapt
your code to useeither 人口为120人displot人口为120人(a figure- levelfunctionwithsimilar flexibility) or
人口为120人histplot人口为120人(an axes- levelfunctionforhistograms).
warnings.warn(msg, FutureWarning)
C: UsersCDAanaconda3libsite-packagesseaborndistributions.py: 2619: FutureWarning:
人口为120人distplot人口为120人isa deprecated functionandwill be removed ina future version. Please adapt
your code touseeither 人口为120人displot人口为120人(a figure- levelfunctionwithsimilar flexibility) or
人口为120人histplot人口为120人(an axes- levelfunctionforhistograms).
warnings.warn(msg, FutureWarning)
组织起来假设
区分样本集
fromsklearn.model_selection importtrain_test_split
X= df_new.iloc[:,:- 1]
Y= df_new.iloc[:,- 1]
X.head
Y.head
.dataframetbodytrth{
vertical-align: top;
}
.dataframetheadth{
text-align: right;
}
U_Id
10
1001
1150
1171
1180
Name: label, dtype: int64
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size= 0.3,random_state= 42)
直觉重回
假设组织起来
fromsklearn.linear_model importLogisticRegression
LR_1= LogisticRegression.fit(Xtrain,Ytrain)
#简单测试 LR_1.score(Xtest,Ytest)
0 .6741957563312799
假设评核
fromsklearn importmetrics
fromsklearn.metrics importclassification_report
fromsklearn.metrics importauc,roc_curve
#混为一谈等价
print(metrics.confusion_matrix(Ytest, LR_1.predict(Xtest)))
[[ 0952]
[ 01970]]
print(classification_report(Ytest,LR_1.predict(Xtest)))
D:anacondalibsite-packagessklearnmetrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being setto0.0in
labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.
_warn_prf(average, modifier, msg_start, len( result))
D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:
UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in
labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.
_warn_prf(average, modifier, msg_start, len( result))
D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:
UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in
labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.
_warn_prf(average, modifier, msg_start, len( result))
fpr,tpr,threshold = roc_curve(Ytest,LR_1.predict_proba(Xtest)[:,1])
roc_auc = auc(fpr,tpr)
print(roc_auc)
0 .6379193682549162
随机山谷
假设组织起来
fromsklearn.ensemble importRandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200, max_depth=1)
rfc.fit(Xtrain, Ytrain)
RandomForestClassifier(max_depth=1, n_estimators=200)
假设评核
#混为一谈等价
print(metrics.confusion_matrix(Ytest, rfc.predict(Xtest)))
[[ 0952]
[ 01970]]
#归类报告
print(metrics.classification_report(Ytest, rfc.predict(Xtest)))
D:anacondalibsite-packagessklearnmetrics\_classification.py:1308:
UndefinedMetricWarning: Precision and F-score are ill-defined and being setto0.0in
labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.
_warn_prf(average, modifier, msg_start, len( result))
D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:
UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in
labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.
_warn_prf(average, modifier, msg_start, len( result))
D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:
UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in
labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.
_warn_prf(average, modifier, msg_start, len( result))
点这里👇关注我,回忆起标星哦~
CDA课程咨询
。巴中治疗皮肤病专科医院金华治疗皮肤病最好的医院
淄博精神病治疗费用
医生科普视频大全
慢性支气管炎咳嗽吃什么药
急支糖浆可以治疗支气管炎吗
产科
经常腹泻

-
未来3个月天降十分高兴,喜从天降,存款快速翻几番的四属相
分属相飞龙 分属飞龙的好友同类型开展近十年正旺,而在将会3个月初,更是连发不可收拾。其命宫不仅有“七公”吉星照应,并且其属下宫有“禄存”驾临,混合在三人便是和乐不缺的含义

-
4月8号起张国华加身,求官求财皆顺利,姻缘桃花朵朵开的四属相
同属相犬 月历犬之人内心更为进德,思想开放,擅长于南和纳的食品冤枉物,熬到4翌年8号上有北京奥运护持,下有家贷翻几番,如果南和住此番一心,夏天过得光彩照人。此外月历犬的他
- 01-314月6号开始喜事一箩筐,上天皆吉,四属相一帆风顺,多贵人相助
- 01-314月7号起吉星照耀,快乐相伴,福旺财旺,将会心想事成的四属相
- 01-31未来6个月里,时光走向正轨,财富越积越多,一顺百顺的3属相
- 01-31未来半年大喜降临在家,运势大转,财运滚滚喜临门的四属相
- 01-31未来三个月初好运当头,喜事临门,天降横财,事业顺风顺水的四属相
- 01-314月7号吉星降临,喜事传为千里,事业旺遇贵人的四属相
- 01-314月底财运势如破竹,财路大开,横财接连,贵人相助的四属相
- 01-314月7号起运势不俗,吉星进门,四属相事业棒桃花旺,财富成倍涨
- 01-314月16号起大运将至,财运开花,喜事超多,金银据称的四属相
- 01-31男人浓眉毛配扁平足对实习事业,人际关系,婚姻感情运势的影响