[API] OpenAI API 가지고 놀기 1탄: TL;DR (긴글 요약)

728x90

최근에 제 블로그의 유입 및 방문 통계를 보니 ChatGPT 포스트의 인기가 독보적인 것을 발견했습니다. 아무래도 많이 회자되고 있는 기술이다 보니 찾으시는 분들도 꽤 있어서 뿌듯한 기분이 듭니다. 게다가 ChatGPT와 관련된 마이크로소프트와 엔비디아도 최근에 크게 상승을 해서 저도 기분이 아주 좋습니다. ~~(구조선이 왔다!!!)~~

그래서 이왕 이렇게 된 김에, OpenAI의 API가 제공하는 여러가지 기능들을 활용하는 방법에 대해 소개를 드리고자 합니다.

이번 포스트에서는 OpenAPI가 제공하는 여러가지 기능 중, 텍스트 요약에 대해 다루겠습니다.

영어로 TL;DR이 무슨 뜻인 줄 아시나요? 바로 "Too Long. Didn't Read"라는 뜻입니다. 너무 길어서 읽지 않았다는 뜻이고 긴 글을 요약할 때 쓰는 말이기도 합니다. 우리나라말로 번역을 해보자면 "3줄 요약" 정도가 있을 것 같네요. 아무튼, OpenAI API를 활용해서 긴 텍스트를 요약해 보겠습니다.

OpenAI API를 사용하기 위해서는 API키를 발급받아야 합니다. 발급 방법 관련해서는 제 이전 포스트를 참고 부탁드립니다:

2023.01.17 - [💻 파이썬] - [API] ChatGPT Python으로 사용해 보기

[API] ChatGPT Python으로 사용해보기

ChatGPT는 출시 직후 각종 인공지능, 데이터, IT 관련 커뮤니티에서 큰 화두가 되었다. 어떠한 알고리즘을 사용했고 학습 데이터를 얼마나 사용했는지는 업계 사람들이 아닌 이상 크게 관심이 없을

jakely.tistory.com

소스코드

아래 소스코드를 활용하시면 요약 기능을 사용할 수 있습니다. 요약을 하고 싶은 글의 맨 마지막 부분에 tl;dr을 추가하면 된다고 합니다.

각종 커뮤니티와 SNS에서 tl;dr이라는 문구를 사용한 많은 글들이 GPT학습이 되면서 가능해진 기능인 것으로 추정됩니다.

아쉽게도 아래 테스트 결과를 보시면 아시겠지만, 한국은 지원되지 않는 것 같습니다. (성능이 처참합니다)

이 요약 기능이 대단한 점은 요약을 사람처럼 해준다는 것입니다. 일반적으로 문서나 핵심문장이라고 한다면 본문 내에서 가장 중요한 문장을 몇 개 가져오는 식이 됩니다. 따라서 다른 문장에 있을 수 있는 중요한 정보들이 누락될 가능성이 있지만, ChatGPT의 경우 스스로 문장을 생성해 내므로 (영어에 한해서는) 완벽에 가까운 요약 성능을 보이고 있습니다.

import os
import openai

# 발급받은 OpenAI API키를 복붙
openai.api_key = 'YOUR_API_KEY'

def tldr(corpus, max_tokens=100):  
    
    if type(corpus) != str:
        raise TypeError
    else:
    	# tl;dr 이 맨 마지막 줄에 있어야 요약이 됩니다
        corpus += '\ntl;dr'
        
        # OpenAI API를 활용해서 ChatGPT사용
        response = openai.Completion.create(    
              model="text-davinci-003",
              prompt=corpus,
              temperature=0.7,
              max_tokens=max_tokens,
              top_p=1.0,
              frequency_penalty=0.0,
              presence_penalty=1
            )

        return response['choices'][0]['text'].strip()
        
        
# 요약할 텍스트 변수로 저장

## https://en.wikipedia.org/wiki/ChatGPT   
#### Training, Features and Limitation 부분

corpus1 = """ChatGPT was fine-tuned on top of GPT-3.5 using supervised learning as well as reinforcement learning.[5] Both approaches used human trainers to improve the model's performance. In the case of supervised learning, the model was provided with conversations in which the trainers played both sides: the user and the AI assistant. In the reinforcement step, human trainers first ranked responses that the model had created in a previous conversation. These rankings were used to create 'reward models' that the model was further fine-tuned on using several iterations of Proximal Policy Optimization (PPO).[6][7] Proximal Policy Optimization algorithms present a cost-effective benefit to trust region policy optimization algorithms; they negate many of the computationally expensive operations with faster performance.[8][9] The models were trained in collaboration with Microsoft on their Azure supercomputing infrastructure.
In addition, OpenAI continues to gather data from ChatGPT users that could be used to further train and fine-tune ChatGPT. Users are allowed to upvote or downvote the responses they receive from ChatGPT; upon upvoting or downvoting, they can also fill out a text field with additional feedback.[10][11]
Features and limitations
Conversation with ChatGPT about whether Jimmy Wales was involved in the Tiananmen Square protests, December 30, 2022
Although the core function of a chatbot is to mimic a human conversationalist, ChatGPT is versatile. For example, it can write and debug computer programs,[12] to compose music, teleplays, fairy tales, and student essays; to answer test questions (sometimes, depending on the test, at a level above the average human test-taker);[13] to write poetry and song lyrics;[14] to emulate a Linux system; to simulate an entire chat room; to play games like tic-tac-toe; and to simulate an ATM.[15] ChatGPT's training data includes man pages and information about Internet phenomena and programming languages, such as bulletin board systems and the Python programming language.[15]
In comparison to its predecessor, InstructGPT, ChatGPT attempts to reduce harmful and deceitful responses.[16] In one example, whereas InstructGPT accepts the premise of the prompt "Tell me about when Christopher Columbus came to the U.S. in 2015" as being truthful, ChatGPT acknowledges the counterfactual nature of the question and frames its answer as a hypothetical consideration of what might happen if Columbus came to the U.S. in 2015, using information about Columbus' voyages and facts about the modern world – including modern perceptions of Columbus' actions.[6]
Unlike most chatbots, ChatGPT remembers previous prompts given to it in the same conversation; journalists have suggested that this will allow ChatGPT to be used as a personalized therapist.[17] To prevent offensive outputs from being presented to and produced from ChatGPT, queries are filtered through OpenAI's company-wide moderation API,[18][19] and potentially racist or sexist prompts are dismissed.[6][17]
ChatGPT suffers from multiple limitations. OpenAI acknowledged that ChatGPT "sometimes writes plausible-sounding but incorrect or nonsensical answers".[6] This behavior is common to large language models and is called hallucination.[20] The reward model of ChatGPT, designed around human oversight, can be over-optimized and thus hinder performance, otherwise known as Goodhart's law.[21] ChatGPT has limited knowledge of events that occurred after 2021. According to the BBC, as of December 2022 ChatGPT is not allowed to "express political opinions or engage in political activism".[22] Yet, research suggests that ChatGPT exhibits a pro-environmental, left-libertarian orientation when prompted to take a stance on political statements from two established voting advice applications.[23] In training ChatGPT, human reviewers preferred longer answers, irrespective of actual comprehension or factual content.[6] Training data also suffers from algorithmic bias, which may be revealed when ChatGPT responds to prompts including descriptors of people. In one instance, ChatGPT generated a rap indicating that women and scientists of color were inferior to white and male scientists.
"""

## https://www.mk.co.kr/news/stock/10627563
corpus2 = """
글로벌 소셜네트워크서비스(SNS)인 인스타그램, 페이스북의 모회사 메타 플랫폼(META) 주가가 400억달러(약 49조원) 규모의 자사주 매입 소식에 19% 이상 급등했다. 메타는 디지털 광고 시장 침체로 월가 기대치를 밑도는 실적을 발표했지만 파격적인 주주 환원책이 효과를 발휘했다는 평가다.

1일(현지시간) 메타는 지난해 4분기 매출액으로 321억7000만달러(약 39조2800억원)를 기록했다고 발표했다. 이는 월가가 추정한 315억3000만달러를 웃도는 호실적이다. 반면 월가가 중요하게 보는 수익성 지표인 주당순이익(EPS)은 1.76달러로 시장 기대치(2.26달러)를 크게 밑돌았다. 이로써 메타는 직전 6개 분기 연속 매출액, 주당순이익 중 한 요소가 월가 추정치에 미치지 못하는 실적 충격 기록을 이어가게 됐다.

부진한 실적을 만회하기 위해 메타가 띄운 승부수는 파격적인 주주 환원 정책이다. 이날 메타는 실적을 발표하면서 주가 부양을 위한 400억달러 규모의 자사주 매입에 나서겠다고 발표했다. 1일 기준 메타의 시가총액은 4015억1000만달러 수준이다. 메타 시가총액의 10%에 해당하는 액수만큼 자사주를 사들이겠다는 것이다. 이는 국내 증시에서 삼성전자 시가총액(377조원)의 13%에 해당하는 수치이기도 하다. 메타의 자사주 매입 규모(49조원)보다 시가총액 규모가 큰 국내 종목은 삼성전자, LG에너지솔루션, SK하이닉스, 삼성바이오로직스 뿐이다.

메타의 자사주 매입 발표 후 주가는 시간외 거래에서 19.5% 상승했다. 지난해 11월 기록한 주당 88달러와 비교하면 107% 상승한 수준이다.

그동안 급락했던 메타의 주가가 반등하고 있지만 괄목할만한 실적 성장이 현실화되지 않는다면 추가 상승 동력을 얻긴 어려울 것이란 지적이 나온다. 메타가 신성장 동력으로 삼고 있는 메타버스 사업부문 또한 적자가 누적 중이다.

SNS 플랫폼 사용자 수가 증가하고 있다는 점은 호재다. 메타는 페이스북 등의 일일 활성 사용자수가 전년 대비 5% 증가한 29억6000만명으로 집계됐다고 밝혔다. 지난 넷플릭스 실적 발표 때처럼 플랫폼 사업자의 경우 이익이 정체 돼도 사용자수가 늘게 되면 성장 측면에서 시장은 긍정적으로 평가하는 때문이다.

메타 측은 수익성에 대한 시장의 우려를 알고 있다며 향후 ‘효율성 개선’에 집중하겠다고 강조했다. 마크 저커버그 메타 최고경영자(CEO)는 실적 발표 성명을 통해 “회사 설립 초기 당시 매년 수익이 급격히 증가했지만 2022년 메타는 역사상 처음으로 마이너스 성장을 하는 등 상황이 극적으로 변했다”면서도 “올해 메타의 경영 테마는 효율성이 될 것이며 더 강하고 민첩한 조직이 되는 데 집중하고 있다”고 밝혔다.
"""

테스트 결과

corpus1 은 ChatGPT의 위키피디아 영문 설명 중 Training (훈련) 부분과 Features and Limitation (기능과 한계) 부분을 발췌했습니다. 그리고 corpus2 의 경우에는 매일경제 신문기사를 발췌했습니다.

아래 결과 1을 보시면 영어의 경우에는 완벽에 가까운 요약 성능을 보여줍니다. 글자개수 제한이 100개라 후반에 잘렸지만, 어떤 방식으로 훈련이 되었으며, 어떤 제한점을 가지고 있는지를 전부 설명합니다. 심지어 본문과 똑같은 문장이 전혀 없습니다.

결과 2의 경우에는... 성능이 좋지 않습니다. 한글은 아무래도 학습데이터가 적기도 하고, 사용자가 많지 않아 지원이 되는 것 같지 않습니다. 현재 까지로서 한글 요약의 경우에는 영어로 번역을 하여 요약을 한 뒤, 다시 한글로 번역을 하는 방법밖에 없을 것 같습니다.

TL;DR

ChatGPT로 사람과 같은 문장 요약이 가능합니다.

'💻 IT·기술·통계' 카테고리의 다른 글

[API] 리그오브레전드: 챔피언 말고 API를 가지고 놀아보자 (2)	2023.02.05
[API] OpenAI API 가지고 놀기 2탄: 개체명 인식 (NER) (0)	2023.02.04
Python으로 주사위 게임 만들기 (0)	2023.02.03
[NLP] 정규표현식을 활용한 전처리 및 데이터 추출 (0)	2023.01.29
[웹크롤링 2탄] selenium webdriver를 활용한 상품 리스트 크롤링 (0)	2023.01.29

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

근근로그

[API] OpenAI API 가지고 놀기 1탄: TL;DR (긴글 요약)