当前位置:首页 >> 电脑
电脑

干货 | 分析实战案例——用户行为预测

2025-11-26 12:18

(须要所设为总线的倍至少,否则不会放慢速度快)

data.head

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

data

Dask DataFrame Structure :

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

Dask Name: read-csv, 58 tasks

与pandas不同,这里我们均获取样本框的结构外观设计,而不是实际样本框。Dask已将样本帧分为几块启动时,这些块长期存在 于磁盘上,而不长期存在于RAM中都。如果必须控制器样本帧,则首先须要将所有样本帧都放入RAM,将它们切口在一 起,然后展览品如此一来一的样本帧。常用.compute强迫它这样要用,否则它不.compute 。其实dask常用了一种延误至少 据启动时机制,这种延误机制类似于python的给定器缓冲器,只有当须要常用样本的时候才不会去无论如何启动时样本。

# 无论如何启动时样本 data.compute

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

# 建模岗位进程,58个西区任务 data.visualize

样本预处理过程

样本压缩

# 查看现在的样本多种类型 data.dtypes

U_Id int64

T_Id int64

C_Id int64

Be_type object

Ts int64

dtype: object

# 压缩形同 32位uint,无小写字母整型,因为交易样本很难小至少 dtypes = {

'U_Id': 'uint32',

'T_Id': 'uint32',

'C_Id': 'uint32',

'Be_type': 'object',

'Ts': 'int64'

}

data= data.astype(dtypes)

data.dtypes

U_Id uint32

T_Id uint32

C_Id uint32

Be_type object

Ts int64

dtype: object

紊乱表达式

# 以 dask模块读取的样本,无法这样一来用 .isnull等 pandas常用函至少筛查紊乱表达式 data.isnull

Dask DataFrame Structure :

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

columns1= [ 'U_Id', 'T_Id', 'C_Id', 'Be_type', 'Ts']

tmpDf1 = pd.DataFrame(columns=columns1)

tmpDf1

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

s = data[ "U_Id"].isna

s.loc[s == True]

Dask Series Structure:

npartitions= 58

bool ...

......

...

Name: U_Id, dtype: bool

Dask Name: loc-series, 348tasks

U_Id至多紊乱表达式总至少为0

T_Id至多紊乱表达式总至少为0

C_Id至多紊乱表达式总至少为0

Be_type至多紊乱表达式总至少为0

Ts至多紊乱表达式总至少为0

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

无紊乱表达式

样本探究与建模

这里我们常用pyecharts坎。pyecharts是一款将python与谷歌开源的echarts结合的样本建模工具。新版的1.X和原版的0.5.X新版本预定义规则大 不相同,新版详见官方XML#/README

# pip install pyecharts -i https: //pypi.tuna.tsinghua.edu.cn/simple

Looking inindexes: https: //pypi.tuna.tsinghua.edu.cn/simple

Requirement already satisfied: pyecharts ind:anacondalibsite-packages ( 0.1.9.4)

Requirement already satisfied: jinja2 ind:anacondalibsite-packages ( frompyecharts)

( 3.0.2)

Requirement already satisfied: future ind:anacondalibsite-packages ( frompyecharts)

( 0.18.2)

Requirement already satisfied: pillow ind:anacondalibsite-packages ( frompyecharts)

( 8.3.2)

Requirement already satisfied: MarkupSafe>= 2.0ind:anacondalibsite-packages ( from

jinja2->pyecharts) ( 2.0.1)

Note: you may need to restart the kernel to use updated packages.

U_Id至多紊乱表达式总至少为 0T_Id至多紊乱表达式总至少为 0C_Id至多紊乱表达式总至少为 0Be_type至多紊乱表达式总至少为 0Ts至多紊乱表达式总至少为 0

WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -umpy (d:anacondalibsite-packages)

WARNING: Ignoring invalid distribution -ip (d:anacondalibsite-packages)

饼布

# 例如,我们想画一张漂亮的饼布来看各种客户端举动的占比 data[ "Be_type"]

# 常用dask的时候,所有默许的原pandas的函至少后面需加.compute才能如此一来一监督

Be_counts = data[ "Be_type"].value_counts.compute

Be_counts

pv 89716264

cart 5530446

fav 2888258

buy 2015839

Name: Be_type, dtype: int64

Be_index = Be_counts.index.tolist # 提取ID

Be_index

['pv', 'cart', 'fav', 'buy']

Be_values = Be_counts.values.tolist # 提取至少表达式

Be_values

[ 89716264, 5530446, 2888258, 2015839]

frompyecharts importoptions asopts

frompyecharts.charts importPie

#pie这个包里的样本必须中叶由-bit组形同的至多表

c = Pie

c.add( "", [list(z) forz inzip(Be_index, Be_values)]) # zip函至少的作用是将可给定对象打包形同一 个个-bit,然后返回这些-bit组形同的至多表 c.set_global_opts(title_opts=opts.TitleOpts(title= "客户端举动")) # 一个系统参至少(布取名为) c.set_series_opts(label_opts=opts.LabelOpts(formatter= "{b}: {c}"))

c.render_notebook # 控制器到举例来说notebook环境

# c.render( "pie_base.html") # 若须要可以将布控制器到机

< pyecharts.charts.basic_charts.pie.Pieat0 x1b2da75ae48>

< divid= "490361952ca944fcab93351482e4b254"style= "width:900px; height:500px;">

漏斗布

frompyecharts.charts importFunnel # 原版的pyecharts不须要.charts即可import import pyecharts.options as opts

fromIPython.display importImage asIMG

frompyecharts importoptions asopts

frompyecharts.charts importPie

< pyecharts.charts.basic_charts.funnel.Funnelat0 x1b2939d50c8>

< divid= "071b3b906c27405aaf6bc7a686e36aaa"style= "width:800px; height:400px;">

样本比对

间隔时间撕匹配

dask对于间隔时间撕的默许极为不密切联系

type(data)

dask.dataframe.core.DataFrame

data[ 'Ts1']=data[ 'Ts'].apply(lambda x: time.strftime( "%Y-%m-%d %H:%M:%S",

time.localtime( x)))

data[ 'Ts2']=data[ 'Ts'].apply(lambda x: time.strftime( "%Y-%m-%d", time.localtime( x)))

data[ 'Ts3']=data[ 'Ts'].apply(lambda x: time.strftime( "%H:%M:%S", time.localtime( x)))

D:anaconda libsite-packagesdaskdataframecore.py: 3701: UserWarning:

You did notprovide metadata, so Dask isrunning your functionona small dataset to

guess output types. It ispossible that Dask will guess incorrectly.

Toprovide an explicitoutput types ortosilence this message, please provide the

人口为120人meta=人口为120人 keyword, asdescribed inthe map orapply functionthat you are using.

Before: .apply(func)

After: .apply(func, meta=( 'Ts', 'object'))

warnings.warn(meta_warning(meta))

data.head(1)

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

data.dtypes

U_Id uint32

T_Id uint32

C_Id uint32

Be_type object

Ts int64

Ts1 object

Ts2 object

Ts3 object

dtype: object

得来一其余部分样本来调试预定义

df = data.head(1000000)

df.head(1)

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

客户端水流量和购入间隔时间可能比对

客户端举动统计表

describe = df.loc[:,[ "U_Id", "Be_type"]]

ids = pd.DataFrame(np.zeros(len(set(list(df[ "U_Id"])))),index=set(list(df[ "U_Id"])))

pv_class=describe[describe[ "Be_type"]== "pv"].groupby( "U_Id").count

pv_class.columns = [ "pv"]

buy_class=describe[describe[ "Be_type"]== "buy"].groupby( "U_Id").count

buy_class.columns = [ "buy"]

fav_class=describe[describe[ "Be_type"]== "fav"].groupby( "U_Id").count

fav_class.columns = [ "fav"]

cart_class=describe[describe[ "Be_type"]== "cart"].groupby( "U_Id").count

cart_class.columns = [ "cart"]

user_behavior_counts=ids.join(pv_class).join(fav_class).join(cart_class).join(buy_class).

iloc[:,1:]

user_behavior_counts.head

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

总用户数量形同交量间隔时间波动比对(天)

from matplotlib import font_manager

#克服坐标轴刻度负号Bug

#克服负号 '-'显示为正方形的问题 plt.rcParams[ 'axes.unicode_minus'] = False # 克服中都文Bug问题 plt.rcParams['font.sans-serif'] = ['Simhei']

由总用户数量、形同交量间隔时间波动比对悦,从17年11月25日至17年12月1日用户数量和形同交量长期存在小幅波动,2017年12 月2日用户数量和形同交量均注意到大幅提高上升,2日、3日两天保持更高用户数量和更高形同交量。此现像原因之一为12月2日和3 日为周六,同时选择2日3日可能长期存在某些促销活动,可结合实际业务可能进行具体比对。(布中都星期五用户数量有上 升,但形同交量注意到下降,推测此现像可能与周六活动导致星期五推迟形同交有关。)

总用户数量形同交量间隔时间波动比对(两星期)

# 样本准备 df_pv_timestamp=df[df[ "Be_type"]== "pv"] [["Be_type","Ts1"]]df_pv_timestamp[ "Ts1"] = pd.to_datetime(df_pv_timestamp[ "Ts1"])

df_pv_timestamp=df_pv_timestamp.set_index( "Ts1")

df_pv_timestamp=df_pv_timestamp.resample( "H").count[ "Be_type"]

df_pv_timestamp

df_buy_timestamp=df[df[ "Be_type"]== "buy"] [["Be_type","Ts1"]]

df_buy_timestamp[ "Ts1"] = pd.to_datetime(df_buy_timestamp[ "Ts1"])

df_buy_timestamp=df_buy_timestamp.set_index( "Ts1")

df_buy_timestamp=df_buy_timestamp.resample( "H").count[ "Be_type"]

df_buy_timestamp

Ts1

2017 -09-1116 :00:001

2017 -09-1117 :00:000

2017 -09-1118 :00:000

2017 -09-1119 :00:000

2017 -09-1120 :00:000

...

2017 -12-0320 :00:008587

2017 -12-0321 :00:0010413

2017 -12-0322 :00:009862

2017 -12-0323 :00:007226

2017 -12-0400 :00:001

Freq: H, Name: Be_type, Length: 2001, dtype: int64

Ts1

2017 -11-2500 :00:0064

2017 -11-2501 :00:0029

2017 -11-2502 :00:0018

2017 -11-2503 :00:008

2017 -11-2504 :00:003

...

2017 -12-0319 :00:00141

2017 -12-0320 :00:00159

2017 -12-0321 :00:00154

2017 -12-0322 :00:00154

2017 -12-0323 :00:00123

Freq: H, Name: Be_type, Length: 216, dtype: int64

#绘布

plt.figure(figsize=( 20, 6),dpi = 70)

x2= df_buy_timestamp.index plt.plot(range(len(x2)),df_buy_timestamp.values,label= "形同交量",color= "blue",linewidth= 2) plt.title( "总形同交量波动折现布(两星期)")

x2 = [i.strftime( "%Y-%m-%d %H:%M") fori in x2]

plt.xticks(range(len(x2))[:: 4],x2[:: 4],rotation= 90)

plt.xlabel( "Ts2")

plt.ylabel( "Ts3")

plt.grid(alpha= 0. 4);

特征工程

思路:不选择间隔时间车站内,只以客户端的浏览者和展品等举动来预期确实购入 步骤:以客户端ID(U_Id)为组键,将每位客户端的浏览者、展品、加购物车的举动统计借助于来,共有

确实浏览者,浏览者次至少;确实展品,展品次至少;确实加购物车,加购物车次至少

以此来预期如此一来一确实购入

# 去掉间隔时间撕

df= df[[ "U_Id", "T_Id", "C_Id", "Be_type"]] df

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

举动多种类型

U_Id

1[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

100[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 3, ...

115[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, ...

117[ 4, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, ...

118[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

Name: Be_type1, dtype: object

最后创建一个DataFrame用来传输等下推算借助于的客户端举动。

df_new= pd.DataFrame

浏览者次至少

df_new[ 'pv_much'] = df_Be.apply( lambdax: Counter(x)[ '1'])

df_new

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

加购次至少

#确实加购

df_new[ 'is_cart'] = df_Be.apply( lambdax: 1if'2'inx else0)

df_new

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

#加购了几次

df_new[ 'cart_much'] = df_Be.apply( lambdax: 0if'2'notinx elseCounter(x)[ '2'])

df_new

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

展品次至少

#确实展品

df_new[ 'is_fav'] = df_Be.apply( lambdax: 1if'3'inx else0)

df_new

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

#展品了几次

df_new[ 'fav_much'] = df_Be.apply( lambdax: 0if'3'notinx elseCounter(x)[ '3'])

df_new

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

相关比对

#其余部分样本 df_new.corr( 'spearman')

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

确实加购与加购次至少、确实展品与展品次至少相互间长期存在一定相关性,但经验仍须列于其中都之一与纳入全部变数敏感度基本一致,故不久常用全部变数建模。

样本ID

importseaborn assns

#确实购入

df_new[ 'is_buy'] = df_Be.apply( lambdax: 1if'4'inx else0)

df_new

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

df_new.is_buy.value_counts

1 6689

0 3050

Name: is_buy, dtype: int64

df_new[ 'label'] = df_new[ 'is_buy']

deldf_new[ 'is_buy']

df_new.head

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

f,ax=plt.subplots( 1, 2,figsize=( 12, 5))

sns.set_palette([ "#9b59b6", "#3498db",]) #所设所有布的蓝色,常用hls色彩空间

sns.distplot(df_new[ 'fav_much'],bins= 30,kde= True,label= '123',ax=ax[ 0]);

sns.distplot(df_new[ 'cart_much'],bins= 30,kde= True,label= '12',ax=ax[ 1]);

C:UsersCDAanaconda3libsite-packagesseaborndistributions.py:2619: FutureWarning:

人口为120人distplot人口为120人 is a deprecated function and will be removed in a future version. Please adapt

your code to useeither 人口为120人displot人口为120人(a figure- levelfunctionwithsimilar flexibility) or

人口为120人histplot人口为120人(an axes- levelfunctionforhistograms).

warnings.warn(msg, FutureWarning)

C: UsersCDAanaconda3libsite-packagesseaborndistributions.py: 2619: FutureWarning:

人口为120人distplot人口为120人isa deprecated functionandwill be removed ina future version. Please adapt

your code touseeither 人口为120人displot人口为120人(a figure- levelfunctionwithsimilar flexibility) or

人口为120人histplot人口为120人(an axes- levelfunctionforhistograms).

warnings.warn(msg, FutureWarning)

组织起来假设

区分样本集

fromsklearn.model_selection importtrain_test_split

X= df_new.iloc[:,:- 1]

Y= df_new.iloc[:,- 1]

X.head

Y.head

.dataframetbodytrth{

vertical-align: top;

}

.dataframetheadth{

text-align: right;

}

U_Id

10

1001

1150

1171

1180

Name: label, dtype: int64

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size= 0.3,random_state= 42)

直觉重回

假设组织起来

fromsklearn.linear_model importLogisticRegression

LR_1= LogisticRegression.fit(Xtrain,Ytrain)

#简单测试 LR_1.score(Xtest,Ytest)

0 .6741957563312799

假设评核

fromsklearn importmetrics

fromsklearn.metrics importclassification_report

fromsklearn.metrics importauc,roc_curve

#混为一谈等价

print(metrics.confusion_matrix(Ytest, LR_1.predict(Xtest)))

[[ 0952]

[ 01970]]

print(classification_report(Ytest,LR_1.predict(Xtest)))

D:anacondalibsite-packagessklearnmetrics\_classification.py:1308:

UndefinedMetricWarning: Precision and F-score are ill-defined and being setto0.0in

labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.

_warn_prf(average, modifier, msg_start, len( result))

D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:

UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in

labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.

_warn_prf(average, modifier, msg_start, len( result))

D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:

UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in

labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.

_warn_prf(average, modifier, msg_start, len( result))

fpr,tpr,threshold = roc_curve(Ytest,LR_1.predict_proba(Xtest)[:,1])

roc_auc = auc(fpr,tpr)

print(roc_auc)

0 .6379193682549162

随机山谷

假设组织起来

fromsklearn.ensemble importRandomForestClassifier

rfc = RandomForestClassifier(n_estimators=200, max_depth=1)

rfc.fit(Xtrain, Ytrain)

RandomForestClassifier(max_depth=1, n_estimators=200)

假设评核

#混为一谈等价

print(metrics.confusion_matrix(Ytest, rfc.predict(Xtest)))

[[ 0952]

[ 01970]]

#归类报告

print(metrics.classification_report(Ytest, rfc.predict(Xtest)))

D:anacondalibsite-packagessklearnmetrics\_classification.py:1308:

UndefinedMetricWarning: Precision and F-score are ill-defined and being setto0.0in

labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.

_warn_prf(average, modifier, msg_start, len( result))

D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:

UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in

labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.

_warn_prf(average, modifier, msg_start, len( result))

D:anacondalibsite-packagessklearnmetrics\_classification.py: 1308:

UndefinedMetricWarning: PrecisionandF-score areill-defined andbeing setto0.0in

labels withnopredicted samples. Use人口为120人zero_division人口为120人parameter tocontrol this behavior.

_warn_prf(average, modifier, msg_start, len( result))

点这里👇关注我,回忆起标星哦~

CDA课程咨询

巴中治疗皮肤病专科医院
金华治疗皮肤病最好的医院
淄博精神病治疗费用
医生科普视频大全
慢性支气管炎咳嗽吃什么药
急支糖浆可以治疗支气管炎吗
产科
经常腹泻

上一篇: 海印股份2021年归母净亏损6.23亿 去年同期下降2054.98%

下一篇: 雪松发展:2021年净利润为非零且营收可能低于1亿

相关阅读
未来3个月天降十分高兴,喜从天降,存款快速翻几番的四属相

分属相飞龙 分属飞龙的好友同类型开展近十年正旺,而在将会3个月初,更是连发不可收拾。其命宫不仅有“七公”吉星照应,并且其属下宫有“禄存”驾临,混合在三人便是和乐不缺的含义

4月8号起张国华加身,求官求财皆顺利,姻缘桃花朵朵开的四属相

同属相犬 月历犬之人内心更为进德,思想开放,擅长于南和纳的食品冤枉物,熬到4翌年8号上有北京奥运护持,下有家贷翻几番,如果南和住此番一心,夏天过得光彩照人。此外月历犬的他

友情链接