I'm a beginner at data analysis.
I'm trying something like a simple competition for the first time.
I didn't know what to start with, so I groped one by one, but I didn't know how to find the answer to this.
As a result, the title has not been translated into a language well, and I think it has become difficult to understand.Sorry.
The content of the competition is the prediction of the price of the private lodging service.Information about the property and price data are provided.
"I was scrutinizing the contents of the data, and there was a column ('amenities') about ""facilities and amenities"" among them."
This column holds the data in a list format as shown below.
{TV, "Wireless Internet", Kitchen, "Free parking on premises", Washer, Dryer, "Smoke detector"}
{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door","24-hour check-in",Hangers,Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
Especially, it is hard to believe that the quality of this amenity will affect the price, but I would like to check it just in case, and I would like to do the following.
I checked to see if it is possible to do this kind of operation on Pandas, but I have no idea.
I think I can do 1. if I touch CSV directly, but I can't think of a way to do 2. mechanically.
I have two questions.
python pandas csv
Maybe it's Airbnb data.
Each data in this column is separated by ,
.Also, elements such as Wireless Internet that contain two or more words and contain blank characters are bound by "
.Therefore, follow these steps:
1. and 2. are the tasks of erasing three types of characters: {}"
Pandas can use the .str.replace()
method.
If you want to dummyize string data containing delimiters, as shown in 3, Pandas can easily use the .str.get_dummy()
method.
The following is an example of execution.The method chain uses regular expressions to delete three types of {}"
characters and then dummyize them (data from this page was used).
In[2]: df=pd.read_csv('lists.csv')
...: s=df ['amenities']
...: s.head()
Out [2]:
0 {TV, "Cable TV", Internet, "Wireless Internet", "A...
1 {TV, Internet, "Wireless Internet", Kitchen, "Free...
2 {TV, "Cable TV", Internet, "Wireless Internet", "A...
3 {Internet, "Wireless Internet", Kitchen, "Indoor...
4 {TV, "Cable TV", Internet, "Wireless Internet", Ki...
Name:amenities, dtype:object
In[3]: s.str.replace(r'{|}|"',",regex=True).str.get_dummy(',')
Out [3]:
24-Hour Check-in Air Conditioning Breakfast...Washer/Dryer Wheelchair Accessible Wireless Internet
0 0 1 0 ... 0 0 1
1 0 0 0 ... 0 0 1
2 0 1 0 ... 0 0 1
3 0 0 0 ... 0 0 1
4 0 0 0 ... 0 0 1
... ... ... ... ... ... ... ...
3813 0 1 0 ... 0 1 1
3814 1 0 0 ... 0 0 1
3815 0 0 1 ... 0 0 1
3816 0 0 0 ... 0 0 1
3817 0 0 0 ... 0 0 1
[3818 rows x 41 columns]
Data Series consisting of the following strings
>>import pandas as pd
>>> amenity=pd.Series([
'{HOGE, "PIYO", HUGA},
'{HOGE, "PIYO", HUGA},
])
>>> amenity
0 {HOGE, "PIYO", HUGA}
1 {HOGE, "PIYO", HUGA}
dtype:object
>>table=str.maketrans("", "", '{\"}')
>>>amenity=amenity.str.translate(table).str.split(",")
>>> amenity
0 [HOGE, PIYO, HUGA]
1 [HOGE, PIYO, HUGA]
dtype:object
converts it into a data series of strings.
Based on the above preparations, assume that you would like to do the following:
The following data structures
amenity=[
["A", "B"],
["B", "C"],
["A"],
["B", "C"],
]
to
AB C
0 1 1 0
1 0 1 1
2 1 0 0
3 0 1 1
I would like to create a data structure similar to .
In my case, I often do the following operations.
Python 3.8.11 (default, Aug 321, 06:49:12) MSC v.1916 64bit (AMD64):: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>import pandas as pd
>>>pd.__version__
>>>pd.__version__
'1.3.1'
>>>df=pd.DataFrame({
"amenity": [
{"A", "B"},
{"B", "C"},
{"A",
{"B", "C"},
]
})# Sample data creation
>>df
amenity
0 {A,B}
1 {C,B}
2 {A}
3 {C,B}
>>df.amenity
Convert from .map(lambdas:{k:True forkins})#set to dictionary
.apply(pd.Series)#expand element
.fillna(False)#Defect value to False
ABC Corporation
0 True False
1 False True True
2 True False False
3 False True True
Note: Due to readability, you can't do it in a copy because you've added line breaks and comments in the right place. Please make the necessary corrections yourself.
The scikit-learn also provides a Transformer to do this.
I am not sure if it is more common to use pandas or scikit-learn processing, but just for your information.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
If you search "pandas list column dummy",
https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies
became a hit.
Again, it seems that the solution is proposed to apply pd.Series of pd.Series and MultiLabelBinarizer.
Use pandas.get_dummies.
import pandas as pd
amenities = pd.Series ([
'{TV, "Wireless Internet", Kitchen, "Free parking on premises", Washer, Dryer, "Smoke detector"}',
'{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door","24-hour check-in",Hangers,Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}',
])
category=pd.get_dummy(
amenities.str.strip('{}'
.str.split(r'?"?"?"', expand=True)
.stack()
) .groupby(level=0).sum()
category.T.to_markdown()
537 Uncaught (inpromise) Error on Electron: An object could not be cloned
548 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
709 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
542 Unable to install versioned in Google Colab
547 Who developed the "avformat-59.dll" that comes with FFmpeg?
© 2024 OneMinuteCode. All rights reserved.