Extraer texto de un archivo HTML usando Python

Question

Extraer texto de un archivo HTML usando Python

Me gustaría extraer el texto de un archivo HTML usando Python. Quiero esencialmente la misma salida que obtendría si copie el texto de un navegador y lo pegue en el bloc de notas.

Me gustaría algo más robusto que el uso de expresiones regulares que pueden fallar en HTML mal formado. He visto a muchas personas recomendar Sopa Hermosa, pero he tenido algunos problemas para usarlo. Por un lado, recogió texto no deseado, como el código fuente de JavaScript. Además, no interpretaba entidades HTML. Para por ejemplo, esperaría que ' en el código HTML se convierta en un apóstrofo en el texto, como si hubiera pegado el contenido del navegador en el bloc de notas.

Actualizar html2text parece prometedor. Maneja correctamente las entidades HTML e ignora JavaScript. Sin embargo, no produce exactamente texto plano; produce markdown que luego tendría que convertirse en texto plano. Viene sin ejemplos o documentación, pero el código se ve limpio.

Relacionados preguntas:

184

python html text html-content-extraction

Author: Community, 2008-11-30

Source

29 answers

La mejor pieza de código que encontré para extraer texto sin tener javascript o cosas no deseadas :

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Solo tienes que instalar BeautifulSoup antes:

pip install beautifulsoup4

102

Author: PeYoTlL,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2014-07-07 19:18:20

NOTA: NTLK ya no soporta la función clean_html

Respuesta original a continuación, y una alternativa en las secciones de comentarios.

Use NLTK

Perdí mis 4-5 horas arreglando los problemas con html2text. Por suerte pude encontrar a NLTK.
Funciona mágicamente.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

102

Author: Shatu,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2016-10-22 15:27:39

Me encontré enfrentando el mismo problema hoy. Escribí un analizador HTML muy simple para eliminar el contenido entrante de todas las marcas, devolviendo el texto restante con solo un mínimo de formato.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

52

Author: xperroni,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2010-10-21 13:14:38

Aquí hay una versión de la respuesta de xperroni que es un poco más completa. Omite secciones de script y estilo y traduce charrefs (por ejemplo,') y entidades HTML (por ejemplo,&).

También incluye un convertidor inverso trivial de texto plano a html.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

13

Author: bit4,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2013-05-07 16:04:21

También puede usar el método html2text en la biblioteca de stripogram.

from stripogram import html2text
text = html2text(your_html_string)

Para instalar stripogram ejecute sudo easy_install stripogram

8

Author: GeekTantra,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2009-09-23 03:21:58

Existe una biblioteca de patrones para la minería de datos.

Http://www.clips.ua.ac.be/pages/pattern-web

Incluso puedes decidir qué etiquetas conservar:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

7

Author: Nuncjo,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2012-11-29 19:28:38

PyParsing hace un gran trabajo. El wiki de PyParsing fue asesinado, así que aquí hay otro lugar donde hay ejemplos del uso de PyParsing (enlace de ejemplo). Una razón para invertir un poco de tiempo con pyparsing es que también ha escrito un manual de corte corto O'Reilly muy breve y bien organizado que también es económico.

Dicho esto, uso BeautifulSoup mucho y no es tan difícil lidiar con los problemas de entidades, puede convertirlos antes de ejecutar Hermosa sopa.

Goodluck

6

Author: PyNEwbie,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-08-16 13:18:06

Sé que hay un montón de respuestas, pero la mayoría de los elegent y python solución que he encontrado es descrito, en parte, aquí.

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Actualizar

Basado en el comentario de Fraser, aquí hay una solución más elegante:

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

5

Author: Floyd,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-05-01 00:26:20

Esto no es exactamente una solución de Python, pero convertirá el texto que Javascript generaría en texto, lo cual creo que es importante (E. G. google.com). Los Enlaces del navegador (no Lynx) tiene un motor Javascript, y convertirá el origen en texto con la opción-dump.

Así que podrías hacer algo como:

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

4

Author: Andrew,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2012-05-18 10:02:26

En lugar del módulo HTMLParser, echa un vistazo a htmllib. Tiene una interfaz similar, pero hace más trabajo por ti. (Es bastante antiguo, por lo que no es de mucha ayuda en términos de deshacerse de javascript y css. Podría crear una clase derivada, pero y agregar métodos con nombres como start_script y end_style (consulte los documentos de python para obtener más detalles), pero es difícil hacer esto de manera confiable para html mal formado.) De todos modos, aquí hay algo simple que imprime el texto sin formato a la consola

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

4

Author: Mark,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2015-07-19 00:57:25

Si necesita más velocidad y menos precisión, entonces podría usar raw lxml.

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

4

Author: Anton Shelin,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2016-08-30 11:21:42

Instalar html2text usando

Instalación Pip html2text

Entonces

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

4

Author: Pravitha V,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-04-05 07:16:30

Beautiful soup convierte entidades html. Es probablemente su mejor apuesta teniendo en cuenta que HTML a menudo tiene errores y está lleno de problemas de codificación unicode y html. Este es el código que uso para convertir html a texto raw:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

3

Author: speedplane,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2012-11-30 08:23:23

Recomiendo un paquete de Python llamado goose-extractor Goose intentará extraer la siguiente información:

Texto Principal de un artículo Imagen principal del artículo Cualquier película de Youtube/Vimeo incrustada en el artículo Meta Descripción Etiquetas meta

Más :https://pypi.python.org/pypi/goose-extractor/

3

Author: Li Yingjun,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2015-11-25 13:05:59

Otra opción es ejecutar el html a través de un navegador web basado en texto y volcarlo. Por ejemplo (usando Lynx):

lynx -dump html_to_convert.html > converted_html.txt

Esto se puede hacer dentro de un script python de la siguiente manera:

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

No le dará exactamente solo el texto del archivo HTML, pero dependiendo de su caso de uso puede ser preferible a la salida de html2text.

2

Author: John Lucas,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2014-08-08 02:29:50

Otra solución que no es python: Libre Office:

soffice --headless --invisible --convert-to txt input1.html

La razón por la que prefiero esta sobre otras alternativas es que cada párrafo HTML se convierte en una sola línea de texto (sin saltos de línea), que es lo que estaba buscando. Otros métodos requieren post-procesamiento. Lynx produce una buena salida, pero no es exactamente lo que estaba buscando. Además, Libre Office se puede utilizar para convertir desde todo tipo de formatos...

2

Author: YakovK,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2015-12-11 04:11:45

Sé que ya hay muchas respuestas aquí, pero creo que newspaper3k también merece una mención. Recientemente tuve que completar una tarea similar de extraer el texto de los artículos en la web y esta biblioteca ha hecho un excelente trabajo de lograr esto hasta ahora en mis pruebas. Ignora el texto que se encuentra en los elementos de menú y las barras laterales, así como cualquier JavaScript que aparezca en la página como solicitudes de OP.

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

Si ya tienes los archivos HTML descargados puedes hacer algo así:

article = Article('')
article.set_html(html)
article.parse()
article.text

Incluso tiene algunas características de PNL para resumir los temas de los artículos:

article.nlp()
article.summary

2

Author: spatel4140,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-02-18 13:36:16

De una manera sencilla

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

Este código encuentra todas las partes del html_text que comienzan con ' ' y reemplaza todo lo encontrado por una cadena vacía

1

Author: David Fraga,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2016-06-02 15:04:54

La respuesta de@PeYoTIL usando BeautifulSoup y eliminando el estilo y el contenido del script no funcionó para mí. Lo intenté usando decompose en lugar de extract pero todavía no funcionó. Así que creé la mía que también formatea el texto usando las etiquetas <p> y reemplaza las etiquetas <a> con el enlace href. También hace frente a los enlaces dentro del texto. Disponible en este resumen con un documento de prueba incrustado.

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

1

Author: racitup,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2016-12-06 15:06:19

¿Alguien ha probado bleach.clean(html,tags=[],strip=True)con lejía ? está funcionando para mí.

1

Author: rox,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-01-16 14:10:24

En Python 3.x puedes hacerlo de una manera muy fácil importando paquetes 'imaplib' y 'email'. Aunque este es un post anterior, pero tal vez mi respuesta puede ayudar a los recién llegados en este post.

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

Ahora puede imprimir la variable body y estará en formato de texto plano :) Si es lo suficientemente bueno para usted, entonces sería bueno seleccionarlo como respuesta aceptada.

1

Author: Wahib Ul Haq,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-05-16 20:25:35

He tenido buenos resultados con Apache Tika. Su propósito es la extracción de metadatos y texto del contenido, por lo tanto, el analizador subyacente se ajusta en consecuencia fuera de la caja.

Tika se puede ejecutar como un servidor , es trivial para ejecutar / desplegar en un contenedor Docker, y desde allí se puede acceder a través de enlaces Python.

1

Author: u-phoria,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-05-07 11:07:18

Aquí está el código que uso regularmente.

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

Espero que eso ayude.

0

Author: troymyname00,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-10-25 00:14:19

Mejor trabajado para mí es inscripciones .

Https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

Los resultados son realmente buenos

0

Author: Vimal Thickvijayan Vims,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-04-06 03:14:41

Solo se puede extraer texto de HTML con BeautifulSoup

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

0

Author: saigopi,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-04-13 11:03:57

El comentario de LibreOffice writer tiene mérito ya que la aplicación puede emplear macros de python. Parece ofrecer múltiples beneficios tanto para responder a esta pregunta como para promover la macro base de LibreOffice. Si esta resolución es una implementación única, en lugar de ser utilizada como parte de un programa de producción mayor, abrir el HTML en writer y guardar la página como texto parecería resolver los problemas discutidos aquí.

0

Author: 1of7,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-05-15 22:54:37

Perl way (lo siento mamá, nunca lo haré en producción).

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

0

Author: brunql,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-07-06 11:36:06

Lo estoy logrando algo como esto.

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

-1

Author: Waqar Detho,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2016-08-07 17:27:43

score 102 · Accepted Answer

Html2text es un programa Python que hace un buen trabajo en esto.

102

Author: RexE,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-06-30 20:54:46