Python使用ClickHouse的實踐與踩坑記錄

2022-05-17 17:07:24 來源：互聯(lián)網(wǎng)作者：佚名人氣：次閱讀 2 條評論

clickhouse是近年來備受關(guān)注的開源列式數(shù)據(jù)庫（dbms），主要用于數(shù)據(jù)聯(lián)機分析（olap）領(lǐng)域，于2016年開源。目前國內(nèi)社區(qū)火熱，各個大廠紛紛跟進大規(guī)模使用。今日頭條，內(nèi)部用clickhouse來...

clickhouse是近年來備受關(guān)注的開源列式數(shù)據(jù)庫（dbms），主要用于數(shù)據(jù)聯(lián)機分析（olap）領(lǐng)域，于2016年開源。目前國內(nèi)社區(qū)火熱，各個大廠紛紛跟進大規(guī)模使用。

今日頭條，內(nèi)部用clickhouse來做用戶行為分析，內(nèi)部一共幾千個clickhouse節(jié)點，單集群最大1200節(jié)點，總數(shù)據(jù)量幾十pb，日增原始數(shù)據(jù)300tb左右。
騰訊內(nèi)部用clickhouse做游戲數(shù)據(jù)分析，并且為之建立了一整套監(jiān)控運維體系。
攜程內(nèi)部從2018年7月份開始接入試用，目前80%的業(yè)務(wù)都跑在clickhouse上。每天數(shù)據(jù)增量十多億，近百萬次查詢請求。
快手內(nèi)部也在使用clickhouse，存儲總量大約10pb，每天新增200tb， 90%查詢小于3s。

在國外，yandex內(nèi)部有數(shù)百節(jié)點用于做用戶點擊行為分析，cloudflare、spotify等頭部公司也在使用。

clickhouse最初是為 yandexmetrica 世界第二大web分析平臺而開發(fā)的。多年來一直作為該系統(tǒng)的核心組件被該系統(tǒng)持續(xù)使用著。

1. 關(guān)于clickhouse使用實踐

首先，我們回顧一些基礎(chǔ)概念：

oltp：是傳統(tǒng)的關(guān)系型數(shù)據(jù)庫，主要操作增刪改查，強調(diào)事務(wù)一致性，比如銀行系統(tǒng)、電商系統(tǒng)。
olap：是倉庫型數(shù)據(jù)庫，主要是讀取數(shù)據(jù)，做復(fù)雜數(shù)據(jù)分析，側(cè)重技術(shù)決策支持，提供直觀簡單的結(jié)果。

1.1. clickhouse 應(yīng)用于數(shù)據(jù)倉庫場景

clickhouse做為列式數(shù)據(jù)庫，列式數(shù)據(jù)庫更適合olap場景，olap場景的關(guān)鍵特征：

絕大多數(shù)是讀請求
數(shù)據(jù)以相當(dāng)大的批次(> 1000行)更新，而不是單行更新;或者根本沒有更新。
已添加到數(shù)據(jù)庫的數(shù)據(jù)不能修改。
對于讀取，從數(shù)據(jù)庫中提取相當(dāng)多的行，但只提取列的一小部分。
寬表，即每個表包含著大量的列
查詢相對較少(通常每臺服務(wù)器每秒查詢數(shù)百次或更少)
對于簡單查詢，允許延遲大約50毫秒
列中的數(shù)據(jù)相對較?。簲?shù)字和短字符串(例如，每個url 60個字節(jié))
處理單個查詢時需要高吞吐量(每臺服務(wù)器每秒可達數(shù)十億行)
事務(wù)不是必須的
對數(shù)據(jù)一致性要求低
每個查詢有一個大表。除了他以外，其他的都很小。
查詢結(jié)果明顯小于源數(shù)據(jù)。換句話說，數(shù)據(jù)經(jīng)過過濾或聚合，因此結(jié)果適合于單個服務(wù)器的ram中

1.2. 客戶端工具dbeaver

clickhouse客戶端工具為dbeaver，官網(wǎng)為https://dbeaver.io/。

dbeaver是免費和開源（gpl）為開發(fā)人員和數(shù)據(jù)庫管理員通用數(shù)據(jù)庫工具。[百度百科]
易用性是該項目的主要目標(biāo)，是經(jīng)過精心設(shè)計和開發(fā)的數(shù)據(jù)庫管理工具。免費、跨平臺、基于開源框架和允許各種擴展寫作（插件）。
它支持任何具有一個jdbc驅(qū)動程序數(shù)據(jù)庫。
它可以處理任何的外部數(shù)據(jù)源。

通過操作界面菜單中“數(shù)據(jù)庫”創(chuàng)建配置新連接，如下圖所示，選擇并下載clickhouse驅(qū)動（默認(rèn)不帶驅(qū)動）。

在這里插入圖片描述

dbeaver配置是基于jdbc方式，一般默認(rèn)url和端口如下：

jdbc:clickhouse://192.168.17.61:8123

如下圖所示。

在是用dbeaver連接clickhouse做查詢時，有時候會出現(xiàn)連接或查詢超時的情況，這個時候可以在連接的參數(shù)中添加設(shè)置socket_timeout參數(shù)來解決問題。

jdbc:clickhouse://{host}:{port}[/{database}]?socket_timeout=600000

在這里插入圖片描述

1.3. 大數(shù)據(jù)應(yīng)用實踐

環(huán)境簡要說明：
硬件資源有限，僅有16g內(nèi)存，交易數(shù)據(jù)為億級。

本應(yīng)用是某交易大數(shù)據(jù)，主要包括交易主表、相關(guān)客戶信息、物料信息、歷史價格、優(yōu)惠及積分信息等，其中主交易表為自關(guān)聯(lián)樹狀表結(jié)構(gòu)。

為了分析客戶交易行為，在有限資源的條件下，按日和交易點抽取、匯集交易明細為交易記錄，如下圖所示。

在這里插入圖片描述

其中，在clickhouse上，交易數(shù)據(jù)結(jié)構(gòu)由60個列（字段）組成，截取部分如下所示：

在這里插入圖片描述

針對頻繁出現(xiàn)“would use 10.20 gib , maximum: 9.31 gib”等內(nèi)存不足的情況，基于clickhouse的sql，編寫了提取聚合數(shù)據(jù)集sql語句，如下所示。

在這里插入圖片描述

大約60s返回結(jié)果，如下所示：

在這里插入圖片描述

2. python使用clickhouse實踐

2.1. clickhouse第三方python驅(qū)動clickhouse_driver

clickhouse沒有提供官方python接口驅(qū)動，常用第三方驅(qū)動接口為clickhouse_driver，可以使用pip方式安裝，如下所示：

pip install clickhouse_driver
collecting clickhouse_driver
??downloading https://files.pythonhosted.org/packages/88/59/c570218bfca84bd0ece896c0f9ac0bf1e11543f3c01d8409f5e4f801f992/clickhouse_driver-0.2.1-cp36-cp36m-win_amd64.whl (173kb)
????100% |████████████████████████████████| 174kb 27kb/s
collecting tzlocal<3.0 (from clickhouse_driver)
??downloading https://files.pythonhosted.org/packages/5d/94/d47b0fd5988e6b7059de05720a646a2930920fff247a826f61674d436ba4/tzlocal-2.1-py2.py3-none-any.whl
requirement already satisfied: pytz in d:\python\python36\lib\site-packages (from clickhouse_driver) (2020.4)
installing collected packages: tzlocal, clickhouse-driver
successfully installed clickhouse-driver-0.2.1 tzlocal-2.1

使用的client api不能用了，報錯如下：

file "clickhouse_driver\varint.pyx", line 62, in clickhouse_driver.varint.read_varint

file "clickhouse_driver\bufferedreader.pyx", line 55, in clickhouse_driver.bufferedreader.bufferedreader.read_one

file "clickhouse_driver\bufferedreader.pyx", line 240, in clickhouse_driver.bufferedreader.bufferedsocketreader.read_into_buffer

eoferror: unexpected eof while reading bytes

python驅(qū)動使用clickhouse端口9000。

clickhouse服務(wù)器和客戶端之間的通信有兩種協(xié)議：http（端口8123）和本機（端口9000）。dbeaver驅(qū)動配置使用jdbc驅(qū)動方式，端口為8123。

clickhouse接口返回數(shù)據(jù)類型為元組，也可以返回pandas的dataframe，本文代碼使用的為返回dataframe。

collection = self.client.query_dataframe(self.query_sql)

2.2. 實踐程序代碼

由于我本機最初資源為8g內(nèi)存（現(xiàn)擴到16g），以及實際可操作性，分批次取數(shù)據(jù)保存到多個文件中，每個文件大約為1g。

# -*- coding: utf-8 -*-
'''
created on 2021年3月1日
@author: xiaoyw
'''
import pandas as pd
import json
import numpy as np
import datetime
from clickhouse_driver import client
#from clickhouse_driver import connect
# 基于clickhouse數(shù)據(jù)庫基礎(chǔ)數(shù)據(jù)對象類
class db_obj(object):
????'''
????192.168.17.61:9000
????ebd_all_b04.card_tbl_trade_m_orc
????'''
????def __init__(self, db_name):
????????self.db_name = db_name
????????host='192.168.17.61' #服務(wù)器地址
????????port ='9000' #'8123' #端口
????????user='***' #用戶名
????????password='***' #密碼
????????database=db_name #數(shù)據(jù)庫
????????send_receive_timeout = 25 #超時時間
????????self.client = client(host=host, port=port, database=database) #, send_receive_timeout=send_receive_timeout)
????????#self.conn = connect(host=host, port=port, database=database) #, send_receive_timeout=send_receive_timeout)
?????????
????def setpricetable(self,df):
????????self.pricetable = df
????def get_trade(self,df_trade,filename):?????????
????????print('trade join price!')
????????df_trade = pd.merge(left=df_trade,right=self.pricetable[['occurday','dim_date','end_date','v_0','v_92','v_95','zde_0','zde_92',
??????????????????????????????'zde_95']],how="left",on=['occurday'])
????????df_trade.to_csv(filename,mode='a',encoding='utf-8',index=false)
????def get_datas(self,query_sql):?????????
????????n = 0 # 累計處理卡客戶數(shù)據(jù)
????????k = 0 # 取每次dataframe數(shù)據(jù)量
????????batch = 100000 #100000 # 分批次處理
????????i = 0 # 文件標(biāo)題順序累加
????????flag=true # 數(shù)據(jù)處理解釋標(biāo)志
????????filename = 'card_trade_all_{}.csv'
????????while flag:
????????????self.query_sql = query_sql.format(n, n+batch)
????????????print('query started')
????????????collection = self.client.query_dataframe(self.query_sql)
????????????print('return query result')
????????????df_trade = collection #pd.dataframe(collection)
?????????????
????????????i=i+1
????????????k = len(df_trade)
????????????if k > 0:
????????????????self.get_trade(df_trade, filename.format(i))
?????????????
????????????n = n + batch
????????????if k == 0:
????????????????flag=false???????
????????????print('completed ' + str(k) + 'trade details!')
????????????print('usercard count ' + str(n) )???
????????????????
????????return n???????????????
# 價格變動數(shù)據(jù)集
class price_table(object):
????def __init__(self, cityname, startdate):
????????self.cityname = cityname
????????self.startdate = startdate
????????self.filename = 'price20210531.csv'
?????????
????def get_price(self):
????????df_price = pd.read_csv(self.filename)
????????......
????????????self.price_table=self.price_table.append(data_dict, ignore_index=true)???
?????????????
????????print('generate price table!')??
class cardtradedb(object):
????def __init__(self,db_obj):
????????self.db_obj = db_obj
?????????
????def insertdatasbycsv(self,filename):
????????# 存在數(shù)據(jù)混合類型
????????df = pd.read_csv(filename,low_memory=false)
?????????
????# 獲取交易記錄???
????def gettradedatasbyid(self,id_list=none):
????????# 字符串過長，需要使用'''
????????query_sql = '''select c.carduser_id,c.org_id,c.cardasn,c.occurday as
????????????????......
????????????????limit {},{})
????????????????group by c.carduser_id,c.org_id,c.cardasn,c.occurday
????????????????order by c.carduser_id,c.occurday'''
?????????
?????????
????????n = self.db_obj.get_datas(query_sql)
?????????
????????return n
?????????????????????
if __name__ == '__main__':
????ptable = price_table('湖北','2015-12-01')??
????ptable.get_price()?
?????
????db_obj = db_obj('ebd_all_b04')
????db_obj.setpricetable(ptable.price_table)
????ctd = cardtradedb(db_obj)
????df = ctd.gettradedatasbyid()

返回本地文件為：

在這里插入圖片描述

3. 小結(jié)一下

clickhouse在olap場景下應(yīng)用，查詢速度非常快，需要大內(nèi)存支持。python第三方clickhouse-driver 驅(qū)動基本滿足數(shù)據(jù)處理需求，如果能返回pandas dataframe最好。

clickhouse和pandas聚合都是非?？斓模琧lickhouse聚合函數(shù)也較為豐富（例如文中anylast(x)返回最后遇到的值），如果能通過sql聚合的，還是在clickhouse中完成比較理想，把更小的結(jié)果集反饋給python進行機器學(xué)習(xí)。

操作clickhouse刪除指定數(shù)據(jù)

def info_del2(i):
????client = click_client(host='地址', port=端口, user='用戶名', password='密碼',
??????????????????????????database='數(shù)據(jù)庫')
????sql_detail='alter table ss_goods_order_all delete where order_id='+str(i)+';'
????try:
????????client.execute(sql_detail)
????except exception as e:
????????print(e,'刪除商品數(shù)據(jù)失敗')

在進行數(shù)據(jù)刪除的時候，python操作clickhou和MySQL的方式不太一樣，這里不能使用以往常用的%s然后添加數(shù)據(jù)的方式，必須完整的編輯一條語句，如同上面方法所寫的一樣，傳進去的參數(shù)統(tǒng)一使用str類型

以上為個人經(jīng)驗，希望能給大家一個參考，也希望大家多多支持

人妻系列av无码专区,久久精品国产亚洲a∨麻豆,久久99国产精一区二区三区,国产md视频一区二区三区

Python使用ClickHouse的實踐與踩坑記錄

1. 關(guān)于clickhouse使用實踐

1.1. clickhouse 應(yīng)用于數(shù)據(jù)倉庫場景

1.3. 大數(shù)據(jù)應(yīng)用實踐

2. python使用clickhouse實踐

2.1. clickhouse第三方python驅(qū)動clickhouse_driver

2.2. 實踐程序代碼

您可能感興趣的文章

相關(guān)文章

python連接clickhouse數(shù)據(jù)庫的兩種方式小結(jié)

Python如何保留float類型小數(shù)點后3位

Python如何將數(shù)字變成帶逗號的千分位

Python對數(shù)字的千分位處理方式

python協(xié)程與asyncio庫詳情

Python之父再發(fā)聲：我們能為中國的“996”程序員做什么？

文章分類

最近更新文章

文章排行榜