관리-도구
편집 파일: clean.cpython-39.pyc
a ��a�n � @ s� d Z ddlmZ ddlZddlZddlZzddlmZ ddlm Z W n" e yf ddlmZm Z Y n0 ddlm Z ddlmZ dd lmZmZ dd lmZmZ ze W n ey� eZY n0 ze W n ey� eZY n0 ze W n e�y eefZY n0 g d�Ze�dejejB �jZ e�d ej�jZ!ejdgej"d dk�rTej#fnd�R � j$Z%e�dej�j&Z'e�dej�j&Z(e�dej�j$Z)dd� Z*e�d�jZ+e�dejejB �Z,e �-d�Z.e j-ddeid�Z/G dd� de0�Z1e1� Z2e2j3Z3e�dej�e�dej�gZ4g d �Z5e�d!ej�e�d"ej�e�d#�gZ6d$gZ7e4e5e6e7fd%d&�Z8d'd(� Z9d)d*� Z:e8j e:_ g d+�Z;d,gZ<d-e;e<ed.�fd/d0�Z=d1d2� Z>d3d4� Z?e�d5ej�Z@d6d7� ZAdS )8zcA cleanup tool for HTML. Removes unwanted tags and content. See the `Cleaner` class for details. � )�absolute_importN)�urlsplit)�unquote_plus)r r )�etree)�defs)� fromstring�XHTML_NAMESPACE)� xhtml_to_html�_transform_result)� clean_html�clean�Cleaner�autolink� autolink_html� word_break�word_break_htmlzexpression\s*\(.*?\)z @\s*importz</?[a-zA-Z]+|\son[a-zA-Z]+\s*=� � z:(javascript|jscript|livescript|vbscript|data|about|mocha):z (xml|svg)c C s8 d}t | �D ]}t|�r dS |d7 }qtt| ��|kS )Nr T� )�_find_image_dataurls�_is_unsafe_image_type�len�_possibly_malicious_schemes)�sZsafe_image_urlsZ image_typer r �5/usr/lib64/python3.9/site-packages/lxml/html/clean.py�_has_javascript_schemeV s r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)Z namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ dZdZdZ dZdZdZdZdZdZdZdZejZdZdZddhZdd � Zed ddd gd d d dd�Zdd� Zdd� Zdd� Z dd� Z!dd� Z"d"dd�Z#dd� Z$e%�&de%j'�j(Z)dd� Z*d d!� Z+dS )#r a Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. ``scripts``: Removes any ``<script>`` tags. ``javascript``: Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets as they could contain Javascript. ``comments``: Removes any comments. ``style``: Removes any style tags. ``inline_style`` Removes any style attributes. Defaults to the value of the ``style`` option. ``links``: Removes any ``<link>`` tags ``meta``: Removes any ``<meta>`` tags ``page_structure``: Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. ``processing_instructions``: Removes any processing instructions. ``embedded``: Removes any embedded objects (flash, iframes) ``frames``: Removes any frame-related tags ``forms``: Removes any form tags ``annoying_tags``: Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>`` ``remove_tags``: A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag. ``kill_tags``: A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself. ``allow_tags``: A list of tags to include (default include all). ``remove_unknown_tags``: Remove any tags that aren't standard parts of HTML. ``safe_attrs_only``: If true, only include 'safe' attributes (specifically the list from the feedparser HTML sanitisation web site). ``safe_attrs``: A set of attribute names to override the default list of attributes considered 'safe' (when safe_attrs_only=True). ``add_nofollow``: If true, then any <a> tags will have ``rel="nofollow"`` added to them. ``host_whitelist``: A list or set of hosts that you can use for embedded content (for content like ``<object>``, ``<link rel="stylesheet">``, etc). You can also implement/override the method ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) ``embedded``. Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning. Note that you may also need to set ``whitelist_tags``. ``whitelist_tags``: A set of tags that can be included with ``host_whitelist``. The default is ``iframe`` and ``embed``; you may wish to include other tags like ``script``, or you may want to implement ``allow_embedded_url`` for more control. Set to None to include all tags. This modifies the document *in place*. TFNr �iframe�embedc K s� t � }|�� D ]Z\}}t| ||�}|d ur\|dur\|dur\t|ttttf�s\td||f ��t | ||� q| j d u r�d|vr�| j| _ |�d�r�|�d�r�t d��d| _d S )NTFzUnknown parameter: %s=%r�inline_style� allow_tags�remove_unknown_tags�IIt does not make sense to pass in both allow_tags and remove_unknown_tags)�object�items�getattr� isinstance� frozenset�set�tuple�list� TypeError�setattrr �style�get� ValueErrorr! )�self�kwZnot_an_attribute�name�value�defaultr r r �__init__� s � � zCleaner.__init__�src�href�coder# )�script�link�appletr r �layer�ac C s� z |j }W n ty Y n0 |� }t|� |�d�D ] }d|_q6| jsR| �|� t| jp\d�}t| j pjd�}t| j pxd�}| jr�|�d� | j r�t| j�}|�tj�D ]&}|j}|�� D ]} | |vr�|| = q�q�| j�r | j r�| jtjk�s&|�tj�D ],}|j}|�� D ]} | �d��r || = �q q�|j| jdd� | j�s�t|�D ]P}|�d�} td | �}td |�}| �|��r~|jd= n|| k�rF|�d|� �qF| j�s t|�d��D ]p}|�d d �� � �!� dk�r�|�"� �q�|j#�p�d } td | �}td |�}| �|��rd|_#n|| k�r�||_#�q�| j�r4|�tj$� | j%�rH|�tj&� | j�rZ|�d� | j�rnt�'|d� | j(�r�|�d � nP| j�s�| j�r�t|�d ��D ]0}d|�dd �� � v �r�| �)|��s�|�"� �q�| j*�r�|�d� | j+�r�|�,d� | j-�rdt|�d��D ]B}|�.� }|du�r:|jdv�r:|�.� }�q|du �r|�"� �q|�,d� |�,d� | j/�rx|�,tj0� | j1�r�|�d� |�,d� | j2�r�|�,d� g } g }|�� D ]T}|j|v �r�| �)|��rq�|�3|� n&|j|v �r�| �)|��r��q�| �3|� �q�| �r<| d |k�r<| �4d�}d|_|j�5� n8|�rt|d |k�rt|�4d�}|jdk�rld|_|�5� |�6� |D ]}|�"� �q�| D ]}|�7� �q�| j8�r�|�r�t9d��ttj:�}|�r^| j�s�|�tj$� | j%�s�|�tj&� g }|�� D ]}|j|v�r�|�3|� �q�|�r^|d |u �rJ|�4d�}d|_|j�5� |D ]}|�7� �qN| j;�r�t<|�D ]X}| �=|��sn|�d�}|�r�d|v �r�dd | v �r��qnd!| }nd}|�d|� �qndS )"z& Cleans the document. ZimageZimgr r9 ZonF)Zresolve_base_hrefr- � �typeztext/javascriptz /* deleted */r: Z stylesheet�rel�meta)�head�html�title�paramN)r; r# )r; )r r r<