๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Programming/Python

DataFrame ๊ฟ€ํŒ์ •๋ฆฌ

1. ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋žœ๋ค์ถ”์ถœ

#https://rfriend.tistory.com/602
#df์—์„œ ๋žœ๋ค์œผ๋กœ 1000๊ฐœ๋ฅผ ๋ฝ‘๊ณ  ์‹ถ์„ ๋•Œ
df.sample(n=1000,replace=False) #๋น„๋ณต์›์ถ”์ถœ
df.sample(n=1000,replace=False) #๋ณต์›์ถ”์ถœ

2. ํŠน์ • ๊ธฐ๊ฐ„ ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœ

#https://happy-obok.tistory.com/m/5

from datetime import datetime
Data['timestamp']= pd.to_datetime(Data["timestamp"])
Data = Data[(Data.timestamp >= datetime(start_year,start_month , start_day)) & (Data.timestamp <= datetime(finish_year, finish_month, finish_day))]

3. ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค ์ฐจ๊ณก์ฐจ๊ณก

Data.reset_index(drop=True, inplace=True)

4.reset_index ์ปฌ๋Ÿผ์ด๋ฆ„๊ณผ ํ•จ๊ป˜

Data.reset_index().rename(columns={"index": "index"})

5. ํŠน์ • ์ปฌ๋Ÿผ ์ œ๊ฑฐ

#age ์ปฌ๋Ÿผ์„ ์‚ญ์ œํ•œ ์ƒํƒœ ์ถœ๋ ฅ
df.drop('age', axis=1)
#age ์ปฌ๋Ÿผ ์‚ญ์ œ ๋‚ด์šฉ์„ ๋ฐ”๋กœ df์— ์ ์šฉ -> inplace=True
df.drop('age', axis=1, inplace=True)

#์—ฌ๋Ÿฌ๊ฐœ์˜ ์ปฌ๋Ÿผ ํ•œ๊บผ๋ฒˆ์— ์‚ญ์ œ
df.drop(['age','height'], axis=1, inplace=True)

6. ๊ฒฐ์ธก์น˜๊ฐฏ์ˆ˜

df.isnull().sum()
df['A'].isnull().sum() #A ์ปฌ๋Ÿผ์—์„œ ๊ฒฐ์ธก์น˜ ๊ฐฏ์ˆ˜
df.notnull().sum()

7. ํŠน์ • ์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ ์ค‘๋ณต์ œ๊ฑฐ

#์—ด์„ ๊ธฐ์ค€์œผ๋กœ ์ค‘๋ณต์ œ๊ฑฐ
df.drop_duplicates(['asin'])

8. ํŠน์ • ์ปฌ๋Ÿผ ๊ธฐ์ค€ ํŠน์ • ๋ฌธ์ž ์ง€์šฐ๊ธฐ

products_copy['price'] = products_copy['price'].str.replace('$','')

9. ๊ณต๋ฐฑ์ œ๊ฑฐ

#์–‘์ชฝ๊ณต๋ฐฑ ๋ชจ๋‘ ์ œ๊ฑฐ
products["description"].str.strip()

10. ๊ฐœํ–‰๋ฌธ์ž ์ œ๊ฑฐ

#feature ์ค„๋ฐ”๊ฟˆ์ œ๊ฑฐ
products_copy["feature"]= products_copy["feature"].replace(r'\\n','', regex=True)

11. ๋ฐ์ดํ„ฐ๋ฅผ ๊ธธ์ด๊ธฐ์ค€์œผ๋กœ ํ•„ํ„ฐ๋ง

#title์˜ ๊ธธ์ด๊ฐ€ 300 ๋ฏธ๋งŒ์ธ ๊ฒƒ๋งŒ ์ถœ๋ ฅ
products_copy[products_copy.title.apply(lambda x: len(str(x))<300)]

12. ํŠน์ • ๊ฐ’์„ ํฌํ•จํ•œ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœ

#asin์ด asin_list์— ํฌํ•จ๋˜์–ด์žˆ์„ ๊ฒฝ์šฐ๋งŒ
products_reveiw_copy[products_reveiw_copy.asin.apply(lambda x: x in asin_list)]

13. ํŠน์ • ์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ merge

review_review_count = pd.merge(review_count_per_asin_df,products_reviewText_df, how='inner', on='asin')

14. ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ ์ƒ์„ฑ๊ณผ ๋™์‹œ์— list๋กœ ๋œ ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€

#4๋ฒˆ์งธ ์ปฌ๋Ÿผ์— keywords๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ์ƒ์„ฑ, result๋ผ๋Š” ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋กœ ๋„ฃ๊ธฐ
merged_reivew_q_copy.insert(4,"keywords",result,True)

15. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ปฌ๋Ÿผ์„ ๋ฌธ์ž์—ด์ด ์•„๋‹ˆ๋ผ ๋ฆฌ์ŠคํŠธ๋‹จ์œ„๋กœ ์ฝ๊ธฐ

products.also_buy=products.also_buy.str[1:-1].str.split(',').tolist()

https://stackoverflow.com/questions/45758646/pandas-convert-string-into-list-of-strings

16. ํ–‰๋ฐฉํ–ฅ์œผ๋กœ ์ด์–ด๋ถ™์ด๊ธฐ

pd.concat([df1, df2])
#or
df1.append(df2)

17. ํŠน์ • ์ปฌ๋Ÿผ๊ธฐ์ค€ ์ •๋ ฌ

#์˜ค๋ฆ„์ฐจ์ˆœ
data.sort_values("reviewCount")
#๋‚ด๋ฆผ์ฐจ์ˆœ
data.sort_values("reviewCount",ascending=False)

18. string ์ปฌ๋Ÿผ์˜ ํ‰๊ท ๊ธธ์ด

df["mergedReview"].str.len().mean()

19. ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ๋ฆฌ์ŠคํŠธ๋กœ ๋œ ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€

mergedReview_product.loc[:,'keywords'] = result

20. kw=[(a,1),(b,2)] → ๋ฆฌ์ŠคํŠธ ์•ˆ์— ์žˆ๋Š” ํŠœํ”Œ์—์„œ ์ฒซ๋ฒˆ์งธ ์š”์†Œ๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ

[x[0] for x in kw]

21. ํŠน์ • ์—ด ๋ฌธ์ž์—ด ๊ธธ์ด ์ค„์ด๊ธฐ

#description ์—ด์˜ ๊ธธ์ด๋ฅผ 50์œผ๋กœ ์ž๋ฅด๊ธฐ
description["description"].astype(str).apply(lambda x: x[:50])

22. ๋‚ ์งœ ์ปฌ๋Ÿผ์—์„œ ๋…„๋„๋งŒ ์ถ”๋ฆฌ๊ธฐ

data["year"] = data["year"].apply(lambda x : x[-5:-1]

23. group์œผ๋กœ ๋ฌถ์–ด์„œ ํŠน์ • ์ปฌ๋Ÿผ ํ†ต๊ณ„

df_review.groupby("asin")["review_length"].mean().values
list(df_review.groupby("asin")["review_length"].mean().values)

24. ํŠน์ • ์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ๊ฐ’ ์ถ”๊ฐ€

price_all['nlp'] = np.where(price_all.description.isnull(), 1,0)

description ์ปฌ๋Ÿผ์ด nan์ผ ๋•Œ 1์„ ๋ถ™์ด๊ณ  ์•„๋‹๋• 0