트위터 뉴스기사 크롤링 및 시각화

2019-01-28

.

  • 영미권 메이저 언론사(bbc,nytimes 등) 트위터 뉴스기사에 대한 웹크롤링, Keyword tokenizing 및 시각화 구현
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import requests
import pandas as pd
from selenium import webdriver
from fake_useragent import UserAgent
import nltk
from wordcloud import WordCloud
from nltk import FreqDist
from nltk.tag import pos_tag
from nltk import Text
from nltk.tokenize import RegexpTokenizer
result = []
df_bbc = []
df_newyork = []
df_fox = []
df_abc = []

url_list = ['https://twitter.com/bbcworld',
            'https://twitter.com/nytimes',
            'https://twitter.com/CNN',
            'https://twitter.com/FoxNews',
            'https://twitter.com/ABC'
           ]

for member in url_list:

    options = webdriver.ChromeOptions() 
    options.add_argument("user-agent={}".format(UserAgent().chrome))

    driver = webdriver.Chrome()
    driver.get(member)

    for count in range(10):
        
        time.sleep(1)

        script = "window.scrollTo(0, 100000);"
        driver.execute_script(script)

    items = driver.find_elements_by_css_selector("#stream-items-id .content")
    print("{} : ".format(member[20:]), len(items))

    for idx, item in enumerate(items):
        
        try:
            time_before = item.find_element_by_css_selector("div:nth-child(1) > small > a > span").text
        except:
            time_before = None
        try:
            news_comment = item.find_element_by_css_selector(".js-tweet-text-container p").text   
        except:
            news_comment = None
        try:
            news_url = item.find_element_by_css_selector("div:nth-child(1) > small > a").get_attribute("href")   
        except:
            news_url =  None
        try:
            reaction = item.find_element_by_css_selector("div:nth-child(4) > div:nth-child(2) > div:nth-child(1) .ProfileTweet-actionCountForPresentation").text   
        except:
            reaction = 0
        try:
            retweet = item.find_element_by_css_selector("div:nth-child(4) > div:nth-child(2) > div:nth-child(2) .ProfileTweet-actionCountForPresentation").text
        except:
            retweet = 0
        try:
            like = item.find_element_by_css_selector("div:nth-child(4) > div:nth-child(2) > div:nth-child(3) .ProfileTweet-actionCountForPresentation").text
        except:
            like = 0
        
        data = {
            "time(before)" : time_before,
            "news_comment" : news_comment,
            "news_url" : news_url,
            "reaction" : reaction,
            "retweet" : retweet,
            "like" : like,
            "brand" : member[20:]
        }

        result.append(data)

        df = pd.DataFrame(result)
        df = df[["brand","time(before)","reaction","retweet","like","news_comment","news_url"]]
            
    driver.quit()
bbcworld :  200
nytimes :  200
CNN :  200
FoxNews :  200
ABC :  200
df
brand time(before) reaction retweet like news_comment news_url
0 bbcworld 32분 6 19 39 Are French riots a curse or a blessing for Mac... https://twitter.com/BBCWorld/status/1065052681...
1 bbcworld 46분 27 27 49 Trump submits answers to Mueller's Russia inquiry https://twitter.com/BBCWorld/status/1065049143...
2 bbcworld 52분 5 31 43 Letter from Africa: Cremations 'threaten' Zimb... https://twitter.com/BBCWorld/status/1065047458...
3 bbcworld 1시간 55 347 257 Yemen crisis: 85,000 children 'dead from malnu... https://twitter.com/BBCWorld/status/1065040352...
4 bbcworld 2시간 14 51 78 Nedim Yasar: Reformed gangster shot after book... https://twitter.com/BBCWorld/status/1065027537...
5 bbcworld 3시간 27 130 121 E. coli outbreak: Romaine lettuce probed in US... https://twitter.com/BBCWorld/status/1065014096...
6 bbcworld 4시간 54 341 319 Saudi Arabia 'tortured female activists', char... https://twitter.com/BBCWorld/status/1064996231...
7 bbcworld 6시간 77 88 156 Trump Saudi statement: What the president's wo... https://twitter.com/BBCWorld/status/1064975792...
8 bbcworld 10시간 20 187 490 The Finnish forest that may hold the future fo... https://twitter.com/bbcworldservice/status/106...
9 bbcworld 6시간 24 69 100 Cameroon gunmen seize students from school https://twitter.com/BBCWorld/status/1064963175...
10 bbcworld 7시간 10 43 133 Executed Tanzanian hero's grandson takes DNA t... https://twitter.com/BBCWorld/status/1064957892...
11 bbcworld 7시간 33 99 143 Likely next Interpol chief Prokopchuk 'fox in ... https://twitter.com/BBCWorld/status/1064952743...
12 bbcworld 7시간 22 26 97 Proposals to expand the Qatar 2022 World Cup f... https://twitter.com/BBCSport/status/1064950251...
13 bbcworld 7시간 74 54 148 Tekashi 6ix9ine: What the latest charges could... https://twitter.com/BBCWorld/status/1064951870...
14 bbcworld 7시간 214 146 248 Trump defends Saudi Arabia ties despite Khasho... https://twitter.com/BBCWorld/status/1064951544...
15 bbcworld 8시간 27 177 167 Yemen crisis: Inside a camp where children hun... https://twitter.com/BBCWorld/status/1064945916...
16 bbcworld 8시간 26 119 213 Chicago hospital shooting: Doctor, pharmacist ... https://twitter.com/BBCWorld/status/1064944693...
17 bbcworld 8시간 18 28 72 What to look for in #Brexit declaration? https://twitter.com/BBCWorld/status/1064934846...
18 bbcworld 8시간 97 51 133 What is Foreign Secretary Jeremy Hunt doing in... https://twitter.com/BBCNews/status/10649345206...
19 bbcworld 9시간 51 136 176 Yemen crisis: Why is there a war? https://twitter.com/BBCWorld/status/1064928516...
20 bbcworld 9시간 18 74 185 Huge waterspout hits Italy's south-western city https://twitter.com/BBCWorld/status/1064928447...
21 bbcworld 11시간 12 50 231 Manx climate coalition formed to lobby government https://twitter.com/BBCWorld/status/1064889246...
22 bbcworld 12시간 124 723 915 At least 40 people killed and dozens injured i... https://twitter.com/BBCBreaking/status/1064885...
23 bbcworld 12시간 17 86 196 Turkey must free Selahattin Demirtas, European... https://twitter.com/BBCWorld/status/1064878790...
24 bbcworld 13시간 3 19 114 Sligo woman Vera Dwyer in lung transplant record https://twitter.com/BBCWorld/status/1064871710...
25 bbcworld 13시간 56 160 430 Polar bear display stuns Isle of Man shoppers https://twitter.com/BBCWorld/status/1064869321...
26 bbcworld 14시간 4 26 87 Guide to the #InSight lander's mission to Mars... https://twitter.com/BBCNewsGraphics/status/106...
27 bbcworld 13시간 16 67 168 Rome police seize eight Casamonica mafia villas https://twitter.com/BBCWorld/status/1064863405...
28 bbcworld 13시간 212 1,938 1,659 Dead sperm whale found in Indonesia had ingest... https://twitter.com/BBCWorld/status/1064859760...
29 bbcworld 13시간 19 66 120 Italy Aquarius: Prosecutors order migrant resc... https://twitter.com/BBCWorld/status/1064858607...
... ... ... ... ... ... ... ...
970 ABC 11월 19일 22 64 185 No, Democrats didn’t win the Senate. But they ... https://twitter.com/ABC/status/106465279981903...
971 ABC 11월 19일 10 28 79 Russian and U.S. space officials hail their jo... https://twitter.com/ABC/status/106465279943316...
972 ABC 11월 19일 79 525 478 People evacuate a hospital in Chicago, where p... https://twitter.com/ABC/status/106464811709256...
973 ABC 11월 19일 13 83 113 Four people were found "executed" in the basem... https://twitter.com/ABC/status/106464526549014...
974 ABC 11월 19일 10 24 73 Spanish Prime Minister Pedro Sanchez urges gre... https://twitter.com/ABC/status/106464425941078...
975 ABC 11월 19일 50 107 241 "He's literally in my seats!" A resident in No... https://twitter.com/ABC/status/106464261467319...
976 ABC 11월 19일 16 116 114 NEW: Officer shot and in critical condition am... https://twitter.com/ABC/status/106464081993050...
977 ABC 11월 19일 78 140 345 HAPPENING NOW: Former Pres. Barack Obama speak... https://twitter.com/ABCPolitics/status/1064639...
978 ABC 11월 19일 15 26 42 Big technology and internet companies tumbled ... https://twitter.com/ABC/status/106463995716625...
979 ABC 11월 19일 82 524 433 JUST IN: Multiple people injured in shooting n... https://twitter.com/ABC/status/106463873055480...
980 ABC 11월 19일 22 162 506 India's first hospital for elephants recently ... https://twitter.com/ABC/status/106463735559486...
981 ABC 11월 19일 36 45 133 MORE: CNN has since dropped its suit over the ... https://twitter.com/ABC/status/106463610960554...
982 ABC 11월 19일 29 88 74 Authorities in Las Vegas released surveillance... https://twitter.com/ABC/status/106463419060659...
983 ABC 11월 19일 10 37 31 Haitian police say one of their officers has b... https://twitter.com/ABC/status/106462965717215...
984 ABC 11월 19일 12 19 36 A New Jersey couple traveling to their wedding... https://twitter.com/ABC/status/106462801405773...
985 ABC 11월 19일 0 0 0 "Your hard pass is restored," Sarah Sanders an... https://twitter.com/ABC/status/106462469169065...
986 ABC 11월 19일 117 172 303 JUST IN: White House restores CNN reporter Jim... https://twitter.com/ABC/status/106462380126100...
987 ABC 11월 19일 8 39 42 Russian prosecutors announce new criminal case... https://twitter.com/ABC/status/106461958671114...
988 ABC 11월 19일 4 7 18 A senior coalition partner in Israel's governm... https://twitter.com/ABC/status/106461936567128...
989 ABC 11월 19일 12 14 27 Two Academy Awards for best picture are going ... https://twitter.com/ABC/status/106461832506870...
990 ABC 11월 19일 4 24 40 Romanian novelist who thought she had found a ... https://twitter.com/ABC/status/106461784012323...
991 ABC 11월 19일 44 100 284 Rep. Adam Schiff tells @ThisWeekABC Democrats ... https://twitter.com/ABC/status/106461535674721...
992 ABC 11월 19일 25 40 140 Rep. Adam Schiff to @ThisWeekABC on the murder... https://twitter.com/ABC/status/106461460406249...
993 ABC 11월 19일 12 86 127 SURVIVOR: 17-year-old Formula 3 driver Sophia ... https://twitter.com/ABC/status/106461189765429...
994 ABC 11월 19일 15 28 75 MORE: Watts was called a "heartless monster" b... https://twitter.com/ABC/status/106460842415409...
995 ABC 11월 19일 34 33 114 Three Senate Democrats file lawsuit arguing Ac... https://twitter.com/ABC/status/106460764403032...
996 ABC 11월 19일 3 17 61 "I've been trying to wake up out of this night... https://twitter.com/ABC/status/106460608399661...
997 ABC 11월 19일 15 66 99 HAPPENING NOW: @ABC News has the latest on the... https://twitter.com/ABC/status/106460296517493...
998 ABC 11월 19일 17 32 59 With the coldest months of winter fast approac... https://twitter.com/ABC/status/106460197997502...
999 ABC 11월 19일 17 12 38 The White House Correspondents' Association an... https://twitter.com/ABC/status/106459814134381...

1000 rows × 7 columns

comment_text = ''

for idx in range(len(df['news_comment'])):
    comment_text += df['news_comment'][idx] + ' '
stopwords = ["A",'US','CNN','S','News','New','Sessions','south']

retokenize = RegexpTokenizer("[\w]+")
tokens = pos_tag(retokenize.tokenize(comment_text))
names_list = [data[0] for data in tokens if data[1] == "NNP" and data[0] not in stopwords]
fd_names = FreqDist(names_list)
fd_names.most_common(20)
[('Trump', 138),
 ('House', 68),
 ('California', 56),
 ('President', 50),
 ('White', 43),
 ('Chicago', 35),
 ('Khashoggi', 31),
 ('Saudi', 25),
 ('Ivanka', 24),
 ('Jamal', 24),
 ('Jeff', 24),
 ('Thanksgiving', 21),
 ('Clinton', 20),
 ('Pres', 20),
 ('POTUS', 19),
 ('Florida', 17),
 ('Former', 16),
 ('Senate', 16),
 ('York', 16),
 ('Mueller', 15)]
wc = WordCloud(width=1000, height=600, background_color="white", random_state=0)
plt.imshow(wc.generate_from_frequencies(fd_names))
plt.axis("off")
plt.show()

twitter_6_0