시카고 범죄율을 예측해 보자 ( Facebook 의 Prophet 라이브러리 활용 )¶

STEP #0: PROBLEM STATEMENT¶

The Chicago Crime dataset : 2001 ~ 2017.
Datasource: 캐글 https://www.kaggle.com/currie32/crimes-in-*chicago*
Dataset contains the following columns:
- ID: Unique identifier for the record.
- Case Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
- Date: Date when the incident occurred.
- Block: address where the incident occurred
- IUCR: The Illinois Unifrom Crime Reporting code.
- Primary Type: The primary description of the IUCR code.
- Description: The secondary description of the IUCR code, a subcategory of the primary description.
- Location Description: Description of the location where the incident occurred.
- Arrest: Indicates whether an arrest was made.
- Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
- Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car.
- District: Indicates the police district where the incident occurred.
- Ward: The ward (City Council district) where the incident occurred.
- Community Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
- FBI Code: Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS).
- X Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
- Y Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
- Year: Year the incident occurred.
- Updated On: Date and time the record was last updated.
- Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- Longitude: The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

페이스북에서 만든 오픈소스 Prophet 라이브러리¶

Seasonal time series data를 분석할 수 있는 딥러닝 라이브러리다.
프로펫 공식 레이지 : https://research.fb.com/prophet-forecasting-at-scale/

https://facebook.github.io/prophet/docs/quick_start.html#python-api

코랩에는 자동으로 prophet이 설치되어 있다. 따라서 다른 환경에서 설치 되어있지 않다면, 다음처럼 설치하면 된다.¶

pip install fbprophet
위의 pip 설치 시 에러가 나면 다음처럼 설치해 준다 : conda install -c conda-forge fbprophet

STEP #1: IMPORTING DATA¶

In [1]:

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from prophet import Prophet

In [2]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

In [3]:

import os 
os.chdir('/content/drive/MyDrive/Colab Notebooks/ml_plus/data')

In [ ]:

# Chicago_Crimes_2005_to_2007.csv 
# Chicago_Crimes_2008_to_2011.csv
# Chicago_Crimes_2012_to_2017.csv 파일을 읽되, 
# 각각 파라미터 error_bad_lines=False 추가 해준다.

In [5]:

# 일부 데이터때문에 전체 데이터를 불러올수 없는 증상 == ParserError
# 때문에 error_bad_lines=False  를 사용해서 일부 데이터가 잘못되있더라도 제거하고 가져온다.
chicago_df_1 =  pd.read_csv('Chicago_Crimes_2005_to_2007.csv')

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-5-ffad2a3033fe> in <module>
----> 1 chicago_df_1 =  pd.read_csv('Chicago_Crimes_2005_to_2007.csv')

/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    486 
    487     with parser:
--> 488         return parser.read(nrows)
    489 
    490 

/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py in read(self, nrows)
   1045     def read(self, nrows=None):
   1046         nrows = validate_integer("nrows", nrows)
-> 1047         index, columns, col_dict = self._engine.read(nrows)
   1048 
   1049         if index is None:

/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py in read(self, nrows)
    222         try:
    223             if self.low_memory:
--> 224                 chunks = self._reader.read_low_memory(nrows)
    225                 # destructive to chunks
    226                 data = _concatenate_chunks(chunks)

/usr/local/lib/python3.8/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read_low_memory()

/usr/local/lib/python3.8/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

/usr/local/lib/python3.8/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

/usr/local/lib/python3.8/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 23 fields in line 533719, saw 24

In [6]:

chicago_df_1 =  pd.read_csv('Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False)
chicago_df_2 =  pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)
chicago_df_3 =  pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)

/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3326: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version.


  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 533719: expected 23 fields, saw 24\n'
b'Skipping line 1149094: expected 23 fields, saw 41\n'

In [ ]:

# 데이터 모양을 보고, 이상한 부분은 처리해 준다.
chicago_df_1.drop('Unnamed: 0',axis=1, inplace=True)

In [15]:

chicago_df_2.drop('Unnamed: 0',axis=1, inplace=True)
chicago_df_3.drop('Unnamed: 0',axis=1, inplace=True)

In [19]:

chicago_df_1.head(1)

Out[19]:

	ID	Case Number	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	...	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude	Longitude	Location
0	4673626	HM274058	04/02/2006 01:00:00 PM	055XX N MANGO AVE	2825	OTHER OFFENSE	HARASSMENT BY TELEPHONE	RESIDENCE	False	False	...	45.0	11.0	26	1136872.0	1936499.0	2006	04/15/2016 08:55:02 AM	41.981913	-87.771996	(41.981912692, -87.771996382)

1 rows × 22 columns

In [18]:

# 위의 3개 데이터프레임을 하나로 합친다. 
# 컬럼이름이 동일하기 때문에 concat 이용

chicago_df =  pd.concat( [chicago_df_1,chicago_df_2,chicago_df_3]) 

In [24]:

chicago_df.shape

Out[24]:

(6017767, 22)

In [ ]:

# 범죄 발생 건수 예측

STEP #2: EXPLORING THE DATASET¶

In [25]:

# Let's view the head of the training dataset
chicago_df.head()

Out[25]:

	ID	Case Number	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	...	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude	Longitude	Location
0	4673626	HM274058	04/02/2006 01:00:00 PM	055XX N MANGO AVE	2825	OTHER OFFENSE	HARASSMENT BY TELEPHONE	RESIDENCE	False	False	...	45.0	11.0	26	1136872.0	1936499.0	2006	04/15/2016 08:55:02 AM	41.981913	-87.771996	(41.981912692, -87.771996382)
1	4673627	HM202199	02/26/2006 01:40:48 PM	065XX S RHODES AVE	2017	NARCOTICS	MANU/DELIVER:CRACK	SIDEWALK	True	False	...	20.0	42.0	18	1181027.0	1861693.0	2006	04/15/2016 08:55:02 AM	41.775733	-87.611920	(41.775732538, -87.611919814)
2	4673628	HM113861	01/08/2006 11:16:00 PM	013XX E 69TH ST	051A	ASSAULT	AGGRAVATED: HANDGUN	OTHER	False	False	...	5.0	69.0	04A	1186023.0	1859609.0	2006	04/15/2016 08:55:02 AM	41.769897	-87.593671	(41.769897392, -87.593670899)
3	4673629	HM274049	04/05/2006 06:45:00 PM	061XX W NEWPORT AVE	0460	BATTERY	SIMPLE	RESIDENCE	False	False	...	38.0	17.0	08B	1134772.0	1922299.0	2006	04/15/2016 08:55:02 AM	41.942984	-87.780057	(41.942984005, -87.780056951)
4	4673630	HM187120	02/17/2006 09:03:14 PM	037XX W 60TH ST	1811	NARCOTICS	POSS: CANNABIS 30GMS OR LESS	ALLEY	True	False	...	13.0	65.0	18	1152412.0	1864560.0	2006	04/15/2016 08:55:02 AM	41.784211	-87.716745	(41.784210853, -87.71674491)

5 rows × 22 columns

In [26]:

# Let's view the last elements in the training dataset
chicago_df.tail()

Out[26]:

	ID	Case Number	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	...	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude	Longitude	Location
1456709	10508679	HZ250507	05/03/2016 11:33:00 PM	026XX W 23RD PL	0486	BATTERY	DOMESTIC BATTERY SIMPLE	APARTMENT	True	True	...	28.0	30.0	08B	1159105.0	1888300.0	2016	05/10/2016 03:56:50 PM	41.849222	-87.691556	(41.849222028, -87.69155551)
1456710	10508680	HZ250491	05/03/2016 11:30:00 PM	073XX S HARVARD AVE	1310	CRIMINAL DAMAGE	TO PROPERTY	APARTMENT	True	True	...	17.0	69.0	14	1175230.0	1856183.0	2016	05/10/2016 03:56:50 PM	41.760744	-87.633335	(41.760743949, -87.63333531)
1456711	10508681	HZ250479	05/03/2016 12:15:00 AM	024XX W 63RD ST	041A	BATTERY	AGGRAVATED: HANDGUN	SIDEWALK	False	False	...	15.0	66.0	04B	1161027.0	1862810.0	2016	05/10/2016 03:56:50 PM	41.779235	-87.685207	(41.779234743, -87.685207125)
1456712	10508690	HZ250370	05/03/2016 09:07:00 PM	082XX S EXCHANGE AVE	0486	BATTERY	DOMESTIC BATTERY SIMPLE	SIDEWALK	False	True	...	7.0	46.0	08B	1197261.0	1850727.0	2016	05/10/2016 03:56:50 PM	41.745252	-87.552773	(41.745251975, -87.552773464)
1456713	10508692	HZ250517	05/03/2016 11:38:00 PM	001XX E 75TH ST	5007	OTHER OFFENSE	OTHER WEAPONS VIOLATION	PARKING LOT/GARAGE(NON.RESID.)	True	False	...	6.0	69.0	26	1178696.0	1855324.0	2016	05/10/2016 03:56:50 PM	41.758309	-87.620658	(41.75830866, -87.620658418)

5 rows × 22 columns

비어있는 데이터가 얼마나 되는지 확인하시오.¶

In [27]:

chicago_df.isna().sum()

Out[27]:

ID                          0
Case Number                 7
Date                        0
Block                       0
IUCR                        0
Primary Type                0
Description                 0
Location Description     1974
Arrest                      0
Domestic                    0
Beat                        0
District                   89
Ward                       92
Community Area           1844
FBI Code                    0
X Coordinate            74882
Y Coordinate            74882
Year                        0
Updated On                  0
Latitude                74882
Longitude               74882
Location                74882
dtype: int64

In [ ]:

다음 컬럼들을 삭제하시오.¶

'Case Number', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location', 'District', 'Latitude' , 'Longitude'

In [28]:

cols = ['Case Number', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location', 'District', 'Latitude' , 'Longitude']

In [30]:

chicago_df.drop(cols, axis=1,inplace=True)

Date 컬럼을 보니, 날짜 형식으로 되어있다. 이를 파이썬이 이해할 수 있는 날짜로 바꿔서 다시 Date 컬럼에 저장하시오.¶

In [34]:

chicago_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6017767 entries, 0 to 1456713
Data columns (total 8 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   ID                    int64 
 1   Date                  object
 2   Block                 object
 3   Primary Type          object
 4   Description           object
 5   Location Description  object
 6   Arrest                bool  
 7   Domestic              bool  
dtypes: bool(2), int64(1), object(5)
memory usage: 332.9+ MB

In [35]:

chicago_df['Date'] # 날짜가 아니고 object로 확인됨

Out[35]:

0          04/02/2006 01:00:00 PM
1          02/26/2006 01:40:48 PM
2          01/08/2006 11:16:00 PM
3          04/05/2006 06:45:00 PM
4          02/17/2006 09:03:14 PM
                    ...          
1456709    05/03/2016 11:33:00 PM
1456710    05/03/2016 11:30:00 PM
1456711    05/03/2016 12:15:00 AM
1456712    05/03/2016 09:07:00 PM
1456713    05/03/2016 11:38:00 PM
Name: Date, Length: 6017767, dtype: object

In [37]:

chicago_df['Date'] = pd.to_datetime( chicago_df['Date'], format= '%m/%d/%Y %I:%M:%S %p'  ) # 날짜로 바꾸는 best 방법, iso 포맷이 아니므로 format을 알려줘야함

In [40]:

chicago_df.head()

Out[40]:

	ID	Date	Block	Primary Type	Description	Location Description	Arrest	Domestic
0	4673626	2006-04-02 13:00:00	055XX N MANGO AVE	OTHER OFFENSE	HARASSMENT BY TELEPHONE	RESIDENCE	False	False
1	4673627	2006-02-26 13:40:48	065XX S RHODES AVE	NARCOTICS	MANU/DELIVER:CRACK	SIDEWALK	True	False
2	4673628	2006-01-08 23:16:00	013XX E 69TH ST	ASSAULT	AGGRAVATED: HANDGUN	OTHER	False	False
3	4673629	2006-04-05 18:45:00	061XX W NEWPORT AVE	BATTERY	SIMPLE	RESIDENCE	False	False
4	4673630	2006-02-17 21:03:14	037XX W 60TH ST	NARCOTICS	POSS: CANNABIS 30GMS OR LESS	ALLEY	True	False

In [41]:

chicago_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6017767 entries, 0 to 1456713
Data columns (total 8 columns):
 #   Column                Dtype         
---  ------                -----         
 0   ID                    int64         
 1   Date                  datetime64[ns]
 2   Block                 object        
 3   Primary Type          object        
 4   Description           object        
 5   Location Description  object        
 6   Arrest                bool          
 7   Domestic              bool          
dtypes: bool(2), datetime64[ns](1), int64(1), object(4)
memory usage: 332.9+ MB

In [ ]:

# 범죄가 일어난 시간이, Date 컬럼에 있다.
# 이 정보로부터, 요일정보를 뽑아서
# 각 요일별로, 범죄발생건수를 보여주세요.

In [43]:

chicago_df['weekday']= chicago_df['Date'].dt.weekday

In [44]:

chicago_df['weekday'].value_counts()

Out[44]:

4    910373
2    870841
1    865340
3    860425
5    858153
0    843525
6    809110
Name: weekday, dtype: int64

In [46]:

chicago_df['weekday'] = chicago_df['Date'].dt.strftime('%A')

In [47]:

chicago_df['weekday'].value_counts()

Out[47]:

Friday       910373
Wednesday    870841
Tuesday      865340
Thursday     860425
Saturday     858153
Monday       843525
Sunday       809110
Name: weekday, dtype: int64

Date 컬럼을 인덱스로 만드시오.¶

In [ ]:

# 그룹바이함수를 이용해서는, 날짜 데이터로 바로 
#  년단위, 월단위, 일단위, 시단위, 분단위, 초단위 등으로 묶어라
# 라고 할수 없다!!

# 따라서, 먼저 Date 컬럼을 인덱스로 만들어 준다.
# 그러면, resample 함수를 사용할수 있게 된다.
# 바로 이 함수가, 년단위, 월단위 등등으로 데이터를 묶어서 처리가 가능하다

In [48]:

chicago_df.head()

Out[48]:

	ID	Date	Block	Primary Type	Description	Location Description	Arrest	Domestic	weekday
0	4673626	2006-04-02 13:00:00	055XX N MANGO AVE	OTHER OFFENSE	HARASSMENT BY TELEPHONE	RESIDENCE	False	False	Sunday
1	4673627	2006-02-26 13:40:48	065XX S RHODES AVE	NARCOTICS	MANU/DELIVER:CRACK	SIDEWALK	True	False	Sunday
2	4673628	2006-01-08 23:16:00	013XX E 69TH ST	ASSAULT	AGGRAVATED: HANDGUN	OTHER	False	False	Sunday
3	4673629	2006-04-05 18:45:00	061XX W NEWPORT AVE	BATTERY	SIMPLE	RESIDENCE	False	False	Wednesday
4	4673630	2006-02-17 21:03:14	037XX W 60TH ST	NARCOTICS	POSS: CANNABIS 30GMS OR LESS	ALLEY	True	False	Friday

In [49]:

chicago_df.index = chicago_df['Date']

In [50]:

chicago_df.head()

Out[50]:

	ID	Date	Block	Primary Type	Description	Location Description	Arrest	Domestic	weekday
Date
2006-04-02 13:00:00	4673626	2006-04-02 13:00:00	055XX N MANGO AVE	OTHER OFFENSE	HARASSMENT BY TELEPHONE	RESIDENCE	False	False	Sunday
2006-02-26 13:40:48	4673627	2006-02-26 13:40:48	065XX S RHODES AVE	NARCOTICS	MANU/DELIVER:CRACK	SIDEWALK	True	False	Sunday
2006-01-08 23:16:00	4673628	2006-01-08 23:16:00	013XX E 69TH ST	ASSAULT	AGGRAVATED: HANDGUN	OTHER	False	False	Sunday
2006-04-05 18:45:00	4673629	2006-04-05 18:45:00	061XX W NEWPORT AVE	BATTERY	SIMPLE	RESIDENCE	False	False	Wednesday
2006-02-17 21:03:14	4673630	2006-02-17 21:03:14	037XX W 60TH ST	NARCOTICS	POSS: CANNABIS 30GMS OR LESS	ALLEY	True	False	Friday

범죄 유형의 갯수를 세고, 가장 많은것부터 내림차순으로 보여주세요.¶

In [51]:

chicago_df['Primary Type'].value_counts()

Out[51]:

THEFT                                1245111
BATTERY                              1079178
CRIMINAL DAMAGE                       702702
NARCOTICS                             674831
BURGLARY                              369056
OTHER OFFENSE                         368169
ASSAULT                               360244
MOTOR VEHICLE THEFT                   271624
ROBBERY                               229467
DECEPTIVE PRACTICE                    225180
CRIMINAL TRESPASS                     171596
PROSTITUTION                           60735
WEAPONS VIOLATION                      60335
PUBLIC PEACE VIOLATION                 48403
OFFENSE INVOLVING CHILDREN             40260
CRIM SEXUAL ASSAULT                    22789
SEX OFFENSE                            20172
GAMBLING                               14755
INTERFERENCE WITH PUBLIC OFFICER       14009
LIQUOR LAW VIOLATION                   12129
ARSON                                   9269
HOMICIDE                                5879
KIDNAPPING                              4734
INTIMIDATION                            3324
STALKING                                2866
OBSCENITY                                422
PUBLIC INDECENCY                         134
OTHER NARCOTIC VIOLATION                 122
NON-CRIMINAL                              96
CONCEALED CARRY LICENSE VIOLATION         90
NON - CRIMINAL                            38
HUMAN TRAFFICKING                         28
RITUALISM                                 16
NON-CRIMINAL (SUBJECT SPECIFIED)           4
Name: Primary Type, dtype: int64

상위 15개까지만 보여주세요.¶

In [52]:

chicago_df['Primary Type'].value_counts().head(15)

Out[52]:

THEFT                         1245111
BATTERY                       1079178
CRIMINAL DAMAGE                702702
NARCOTICS                      674831
BURGLARY                       369056
OTHER OFFENSE                  368169
ASSAULT                        360244
MOTOR VEHICLE THEFT            271624
ROBBERY                        229467
DECEPTIVE PRACTICE             225180
CRIMINAL TRESPASS              171596
PROSTITUTION                    60735
WEAPONS VIOLATION               60335
PUBLIC PEACE VIOLATION          48403
OFFENSE INVOLVING CHILDREN      40260
Name: Primary Type, dtype: int64

In [ ]:

상위 15개의 범죄 유형(Primary Type)의 갯수를, 비주얼라이징 하시오.¶

In [53]:

top_15 = chicago_df['Primary Type'].value_counts().head(15)

In [54]:

top_15.index

Out[54]:

Index(['THEFT', 'BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'BURGLARY',
       'OTHER OFFENSE', 'ASSAULT', 'MOTOR VEHICLE THEFT', 'ROBBERY',
       'DECEPTIVE PRACTICE', 'CRIMINAL TRESPASS', 'PROSTITUTION',
       'WEAPONS VIOLATION', 'PUBLIC PEACE VIOLATION',
       'OFFENSE INVOLVING CHILDREN'],
      dtype='object')

In [ ]:

import seaborn as sb

In [124]:

sb.countplot(data = chicago_df , y= 'Primary Type', order= top_15.index )
plt.show()

어디에서 가장 범죄가 많이 발생했는지, 범죄 장소(Location Descripton) 로 비주얼라이징 하시오.¶

In [133]:

loc_top_15 = chicago_df['Location Description'].value_counts().head(15)

In [134]:

loc_top_15

Out[134]:

STREET                            1517724
RESIDENCE                          991977
SIDEWALK                           674793
APARTMENT                          668298
OTHER                              216154
PARKING LOT/GARAGE(NON.RESID.)     166331
ALLEY                              137094
SCHOOL, PUBLIC, BUILDING           128852
RESIDENCE-GARAGE                   119619
VEHICLE NON-COMMERCIAL             107554
RESIDENCE PORCH/HALLWAY            103649
SMALL RETAIL STORE                 103462
RESTAURANT                          89154
DEPARTMENT STORE                    72763
RESIDENTIAL YARD (FRONT/BACK)       72504
Name: Location Description, dtype: int64

In [135]:

plt.figure(figsize=(6,15))
sb.countplot(data = chicago_df , y = 'Location Description', order= loc_top_15.index )
plt.show()

In [ ]:

데이터를 주기별로 분석해 보자¶

스크린샷 2020-12-12 오후 3.29.41.png

In [ ]:

In [60]:

# resample  'Y' 는 년도다. 년도로 리샘플한 후, 각 년도별 몇개의 범죄 데이터를 가지고 있는지 확인한다.
df_year = chicago_df.resample('YS').size()

In [61]:

# 위의 데이터를 plot 으로 시각화 한다. 범죄횟수를 눈으로 확인하자.
plt.plot(df_year)
plt.show()

In [62]:

df_year.plot()

Out[62]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d989acfd0>

In [64]:

chicago_df.sort_index().tail()

Out[64]:

	ID	Date	Block	Primary Type	Description	Location Description	Arrest	Domestic	weekday
Date
2017-01-18 23:30:00	10820633	2017-01-18 23:30:00	116XX S LOWE AVE	BATTERY	DOMESTIC BATTERY SIMPLE	SIDEWALK	False	True	Wednesday
2017-01-18 23:35:00	10820662	2017-01-18 23:35:00	028XX W SHAKESPEARE AVE	ROBBERY	VEHICULAR HIJACKING	ALLEY	False	False	Wednesday
2017-01-18 23:40:00	10821699	2017-01-18 23:40:00	010XX W WILSON AVE	ROBBERY	ARMED: HANDGUN	STREET	False	False	Wednesday
2017-01-18 23:45:00	10820646	2017-01-18 23:45:00	004XX N HOMAN AVE	WEAPONS VIOLATION	UNLAWFUL POSS OF HANDGUN	SIDEWALK	True	False	Wednesday
2017-01-18 23:49:00	10820691	2017-01-18 23:49:00	034XX W DICKENS AVE	ROBBERY	ARMED: HANDGUN	ALLEY	False	False	Wednesday

In [110]:

# 월별 범죄 발생 건수를 확인하자.
df_month = chicago_df.resample('M').size()

In [67]:

# 월별 범죄 발생 건수도 plot 으로 시각화 하자.
df_month.plot()
plt.show()

In [69]:

# 분기별 범죄 건수도 확인하자.
df_q = chicago_df.resample('Q').size()

In [70]:

# 분기별 범죄 건수도 시각화 하자.
df_q.plot()
plt.show()

In [71]:

# 일별
df_day = chicago_df.resample('D').size()

In [72]:

df_day.head()

Out[72]:

Date
2005-01-01    2038
2005-01-02    1075
2005-01-03    1091
2005-01-04    1123
2005-01-05     945
Freq: D, dtype: int64

STEP #3: 데이터 준비¶

일별로 주기로 하여 데이터프레임을 만들고, 인덱스를 리셋하시오.¶

In [73]:

chicago_prophet =  df_day = chicago_df.resample('D').size()

In [74]:

chicago_prophet.head()

Out[74]:

Date
2005-01-01    2038
2005-01-02    1075
2005-01-03    1091
2005-01-04    1123
2005-01-05     945
Freq: D, dtype: int64

프로펫 라이브러리를 사용하려면, 날짜 컬럼은 'ds' 로, 에측하려는 수치는 'y'로 바꿔야 합니다(필수).¶

In [76]:

chicago_prophet = chicago_prophet.reset_index() # 리셋인덱스하면 프레임으로 바뀐다.

In [78]:

chicago_prophet.columns = ['ds','y']

In [80]:

chicago_prophet.head()

Out[80]:

	ds	y
0	2005-01-01	2038
1	2005-01-02	1075
2	2005-01-03	1091
3	2005-01-04	1123
4	2005-01-05	945

STEP #4: Prophet 으로 예측하기¶

In [82]:

prophet = Prophet()

In [83]:

prophet.fit(chicago_prophet)

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp8k20uzp8/u2ho1403.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp8k20uzp8/5ze1vzea.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.8/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=72910', 'data', 'file=/tmp/tmp8k20uzp8/u2ho1403.json', 'init=/tmp/tmp8k20uzp8/5ze1vzea.json', 'output', 'file=/tmp/tmp8k20uzp8/prophet_model2kbr8qrr/prophet_model-20230103060819.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
06:08:19 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
06:08:23 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

Out[83]:

<prophet.forecaster.Prophet at 0x7f0d988bf6d0>

In [ ]:

freq 조절 가능한 문자 ¶

페이스북 프로펫 페이지 예시

In [84]:

# 2년치(730) 로 해서 예측해보자.
future = prophet.make_future_dataframe( periods= 730, freq = 'D' )

In [85]:

forecast = prophet.predict(future)

In [86]:

forecast

Out[86]:

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	additive_terms	additive_terms_lower	additive_terms_upper	weekly	weekly_lower	weekly_upper	yearly	yearly_lower	yearly_upper	multiplicative_terms	multiplicative_terms_lower	multiplicative_terms_upper	yhat
0	2005-01-01	1225.675125	802.182709	1278.750072	1225.675125	1225.675125	-187.016884	-187.016884	-187.016884	-3.020912	-3.020912	-3.020912	-183.995972	-183.995972	-183.995972	0.0	0.0	0.0	1038.658241
1	2005-01-02	1225.631512	709.542207	1200.654682	1225.631512	1225.631512	-257.693454	-257.693454	-257.693454	-80.891351	-80.891351	-80.891351	-176.802103	-176.802103	-176.802103	0.0	0.0	0.0	967.938058
2	2005-01-03	1225.587899	817.962732	1278.353915	1225.587899	1225.587899	-195.782863	-195.782863	-195.782863	-26.076470	-26.076470	-26.076470	-169.706393	-169.706393	-169.706393	0.0	0.0	0.0	1029.805035
3	2005-01-04	1225.544285	825.196334	1318.703652	1225.544285	1225.544285	-154.108636	-154.108636	-154.108636	8.700692	8.700692	8.700692	-162.809328	-162.809328	-162.809328	0.0	0.0	0.0	1071.435650
4	2005-01-05	1225.500672	845.041647	1346.002793	1225.500672	1225.500672	-138.658170	-138.658170	-138.658170	17.552502	17.552502	17.552502	-156.210672	-156.210672	-156.210672	0.0	0.0	0.0	1086.842502
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5126	2019-01-14	704.530323	-1617.325305	2933.159993	-1488.037608	3163.521827	-150.157297	-150.157297	-150.157297	-26.076470	-26.076470	-26.076470	-124.080826	-124.080826	-124.080826	0.0	0.0	0.0	554.373027
5127	2019-01-15	704.507683	-1551.686328	2961.416689	-1496.644038	3171.441095	-115.366493	-115.366493	-115.366493	8.700692	8.700692	8.700692	-124.067186	-124.067186	-124.067186	0.0	0.0	0.0	589.141189
5128	2019-01-16	704.485042	-1600.828384	3073.532218	-1505.092586	3175.171670	-107.366825	-107.366825	-107.366825	17.552502	17.552502	17.552502	-124.919327	-124.919327	-124.919327	0.0	0.0	0.0	597.118217
5129	2019-01-17	704.462402	-1587.485477	3000.880004	-1513.380447	3178.859419	-124.565981	-124.565981	-124.565981	2.050915	2.050915	2.050915	-126.616896	-126.616896	-126.616896	0.0	0.0	0.0	579.896421
5130	2019-01-18	704.439761	-1517.262698	3134.508328	-1521.668307	3182.547169	-47.440433	-47.440433	-47.440433	81.684624	81.684624	81.684624	-129.125056	-129.125056	-129.125056	0.0	0.0	0.0	656.999328

5131 rows × 19 columns

In [88]:

prophet.plot(forecast)
plt.savefig('chart11.jpg')

In [89]:

prophet.plot_components(forecast) # weekly == 요일로 본다 // yearlty == 월로본다.
plt.savefig('chart12.jpg')

In [ ]:

# 월별 데이터로 예측! 36개월치로 예측

In [111]:

df_month

Out[111]:

Date
2005-01-31    33983
2005-02-28    32042
2005-03-31    36970
2005-04-30    38963
2005-05-31    40572
              ...  
2016-09-30    23235
2016-10-31    23314
2016-11-30    21140
2016-12-31    19580
2017-01-31    11357
Freq: M, Length: 145, dtype: int64

In [112]:

df_month = df_month.reset_index()

In [113]:

df_month

Out[113]:

	Date	0
0	2005-01-31	33983
1	2005-02-28	32042
2	2005-03-31	36970
3	2005-04-30	38963
4	2005-05-31	40572
...	...	...
140	2016-09-30	23235
141	2016-10-31	23314
142	2016-11-30	21140
143	2016-12-31	19580
144	2017-01-31	11357

145 rows × 2 columns

In [114]:

df_month.columns = ['ds','y']

In [115]:

df_month

Out[115]:

	ds	y
0	2005-01-31	33983
1	2005-02-28	32042
2	2005-03-31	36970
3	2005-04-30	38963
4	2005-05-31	40572
...	...	...
140	2016-09-30	23235
141	2016-10-31	23314
142	2016-11-30	21140
143	2016-12-31	19580
144	2017-01-31	11357

145 rows × 2 columns

In [116]:

prophet2 = Prophet()

In [117]:

prophet2.fit(df_month)

INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp8k20uzp8/n7_vkihe.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp8k20uzp8/n4o42dop.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.8/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=1637', 'data', 'file=/tmp/tmp8k20uzp8/n7_vkihe.json', 'init=/tmp/tmp8k20uzp8/n4o42dop.json', 'output', 'file=/tmp/tmp8k20uzp8/prophet_modelf97bj85x/prophet_model-20230103064255.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
06:42:55 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
06:42:55 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

Out[117]:

<prophet.forecaster.Prophet at 0x7f0d97637ee0>

In [118]:

future2 = prophet2.make_future_dataframe( periods = 36, freq = 'M')

In [119]:

forecast2 = prophet2.predict(future2)

In [120]:

forecast2

Out[120]:

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	additive_terms	additive_terms_lower	additive_terms_upper	yearly	yearly_lower	yearly_upper	multiplicative_terms	multiplicative_terms_lower	multiplicative_terms_upper	yhat
0	2005-01-31	60454.915750	38461.924820	72269.986261	60454.915750	60454.915750	-4762.551264	-4762.551264	-4762.551264	-4762.551264	-4762.551264	-4762.551264	0.0	0.0	0.0	55692.364486
1	2005-02-28	60322.513711	34534.175412	67602.103091	60322.513711	60322.513711	-9500.714243	-9500.714243	-9500.714243	-9500.714243	-9500.714243	-9500.714243	0.0	0.0	0.0	50821.799468
2	2005-03-31	60175.925739	41305.352421	74941.994143	60175.925739	60175.925739	-1224.220149	-1224.220149	-1224.220149	-1224.220149	-1224.220149	-1224.220149	0.0	0.0	0.0	58951.705590
3	2005-04-30	60034.066412	44155.843301	78561.681987	60034.066412	60034.066412	1182.849883	1182.849883	1182.849883	1182.849883	1182.849883	1182.849883	0.0	0.0	0.0	61216.916294
4	2005-05-31	59887.478440	47548.508428	81642.559366	59887.478440	59887.478440	5498.337383	5498.337383	5498.337383	5498.337383	5498.337383	5498.337383	0.0	0.0	0.0	65385.815823
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
176	2019-09-30	2120.711359	-12389.989843	21614.497157	1428.911452	2874.558107	1911.286936	1911.286936	1911.286936	1911.286936	1911.286936	1911.286936	0.0	0.0	0.0	4031.998295
177	2019-10-31	1715.644536	-12040.568333	20507.855457	985.105538	2503.338904	2480.326606	2480.326606	2480.326606	2480.326606	2480.326606	2480.326606	0.0	0.0	0.0	4195.971143
178	2019-11-30	1323.644386	-17133.224048	16179.640163	549.995789	2137.253867	-2026.637311	-2026.637311	-2026.637311	-2026.637311	-2026.637311	-2026.637311	0.0	0.0	0.0	-702.992925
179	2019-12-31	918.577564	-22752.554128	11660.499006	102.821851	1762.425870	-6033.991502	-6033.991502	-6033.991502	-6033.991502	-6033.991502	-6033.991502	0.0	0.0	0.0	-5115.413939
180	2020-01-31	513.510741	-19692.675197	13075.314357	-348.573294	1403.626213	-4793.512159	-4793.512159	-4793.512159	-4793.512159	-4793.512159	-4793.512159	0.0	0.0	0.0	-4280.001418

181 rows × 16 columns

In [121]:

prophet2.plot(forecast2)
plt.savefig('chart13.jpg')

In [122]:

prophet2.plot_components(forecast2)
plt.savefig('chart13.png')

Pandas Prophet 라이브러리를 이용한 Time Series 데이터 예측 방법 (0)	2023.01.02
Pandas datetime,datetime64,데이터프레임 날짜 일괄처리 (0)	2022.11.30
Pandas Tip[1] 문자열 컬럼의 슬라이싱. str (0)	2022.11.30
Pandas concat(), merge() 여러 데이터 프레임을 하나로 합치는 방법 (0)	2022.11.25
Pandas 데이터프레임 오름차순, 내림차순 정렬 .Sort_values() ,sort_index() (0)	2022.11.25

블로그안함

Pandas 시카고 범죄율을 예측 Prophet, error_bad_lines, to_datetime(format), resample 함수의 사용법과, 이 함수를 사용하기 위해 인덱스를 설정하는 방법

시카고 범죄율을 예측해 보자 ( Facebook 의 Prophet 라이브러리 활용 )¶

STEP #0: PROBLEM STATEMENT¶

페이스북에서 만든 오픈소스 Prophet 라이브러리¶

코랩에는 자동으로 prophet이 설치되어 있다. 따라서 다른 환경에서 설치 되어있지 않다면, 다음처럼 설치하면 된다.¶

STEP #1: IMPORTING DATA¶

STEP #2: EXPLORING THE DATASET¶

비어있는 데이터가 얼마나 되는지 확인하시오.¶

다음 컬럼들을 삭제하시오.¶

Date 컬럼을 보니, 날짜 형식으로 되어있다. 이를 파이썬이 이해할 수 있는 날짜로 바꿔서 다시 Date 컬럼에 저장하시오.¶

Date 컬럼을 인덱스로 만드시오.¶

범죄 유형의 갯수를 세고, 가장 많은것부터 내림차순으로 보여주세요.¶

상위 15개까지만 보여주세요.¶

상위 15개의 범죄 유형(Primary Type)의 갯수를, 비주얼라이징 하시오.¶

어디에서 가장 범죄가 많이 발생했는지, 범죄 장소(Location Descripton) 로 비주얼라이징 하시오.¶

데이터를 주기별로 분석해 보자¶

STEP #3: 데이터 준비¶

일별로 주기로 하여 데이터프레임을 만들고, 인덱스를 리셋하시오.¶

프로펫 라이브러리를 사용하려면, 날짜 컬럼은 'ds' 로, 에측하려는 수치는 'y'로 바꿔야 합니다(필수).¶

STEP #4: Prophet 으로 예측하기¶

freq 조절 가능한 문자 ¶

'DataScience > Pandas' 카테고리의 다른 글

티스토리툴바

Pandas 시카고 범죄율을 예측 Prophet, error_bad_lines, to_datetime(format), resample 함수의 사용법과, 이 함수를 사용하기 위해 인덱스를 설정하는 방법

시카고 범죄율을 예측해 보자 ( Facebook 의 Prophet 라이브러리 활용 )¶

STEP #0: PROBLEM STATEMENT¶

페이스북에서 만든 오픈소스 Prophet 라이브러리¶

코랩에는 자동으로 prophet이 설치되어 있다. 따라서 다른 환경에서 설치 되어있지 않다면, 다음처럼 설치하면 된다.¶

STEP #1: IMPORTING DATA¶

STEP #2: EXPLORING THE DATASET¶

비어있는 데이터가 얼마나 되는지 확인하시오.¶

다음 컬럼들을 삭제하시오.¶

Date 컬럼을 보니, 날짜 형식으로 되어있다. 이를 파이썬이 이해할 수 있는 날짜로 바꿔서 다시 Date 컬럼에 저장하시오.¶

Date 컬럼을 인덱스로 만드시오.¶

범죄 유형의 갯수를 세고, 가장 많은것부터 내림차순으로 보여주세요.¶

상위 15개까지만 보여주세요.¶

상위 15개의 범죄 유형(Primary Type)의 갯수를, 비주얼라이징 하시오.¶

어디에서 가장 범죄가 많이 발생했는지, 범죄 장소(Location Descripton) 로 비주얼라이징 하시오.¶

데이터를 주기별로 분석해 보자¶

STEP #3: 데이터 준비¶

일별로 주기로 하여 데이터프레임을 만들고, 인덱스를 리셋하시오.¶

프로펫 라이브러리를 사용하려면, 날짜 컬럼은 'ds' 로, 에측하려는 수치는 'y'로 바꿔야 합니다(필수).¶

STEP #4: Prophet 으로 예측하기¶

freq 조절 가능한 문자 ¶

'DataScience > Pandas' 카테고리의 다른 글

관련글

티스토리툴바